Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
On Nov 9, 2017, at 21:49, Brian Cole wrote: > Certainly, but thousands of lines of Python doesn't fit in an email in an > easily digestible way. :-) I'll restate things since I wasn't clear. While this step may be what you need for the way you structure things, there might be a better way to structure things. I can't tell because I don't know what it is you are trying to do. > The reason I need to drop into a real RDKit molecule is because I want to be > able to attach to any implicit hydrogen for my application. I couldn't think > of an easy regular expression that located an atom block with one or more > implicit hydrogens. There isn't one. That requires at least a context-free grammar because it needs to count the valence used by branches, and branches can be arbitrarily nested. I think your "any implicit hydrogen" will have problems when the implicit hydrogen count is specified in square brackets, as with a chiral hydrogen, or an atom outside of the organic subset, or one with another property specified (e.g., isotopes or charge). Leaving the tricky chiral hydrogen aside, you're turning: [C@]([*:9])1(C)C[SiH2]O1 where the silicon has an implicit hydrogen count of 2 and a valence of 4, into C[C@]19C[SiH2]8O1 where the silicon is now 5-valent. Similarly, [B-]1OCC[NH+]1 becomes [B-]1OCC[NH+]18 If you have some way to annotate which atoms have at least one implicit hydrogen then you can use the regular expression from my last email, and if it uses []s then reach in and reduce the H count by 1 as part of the transformation. You'll still need some special code to deal with chiral hydrogens. BTW, I don't think you need closures for this at all. You have a set of fragments, where you know which atom will be attached, and I believe you control the ordering of the atoms in that fragment. If you use MolToSmiles(rootedAtAtom) so that the attachment atom is always first, e.g. place the "O" in the phenol first: Oc1c1 then you can attach it to the core at a given point using a branch, e.g., to attach it to the 4th atom of: [C@]1(C)CCO1 -> [C@]1(C)CC(Oc1c1)O1 This is the 4th regular expression match (to find the atom, and skip any ring closures), followed by '(' followed by the rooted fragment followed by ')' followed by the rest of the original string. Plus some modification of the regular expression match itself to reduce the H count, if the implicit H-count is stated explicitly. Assuming the core attachment point has at least one implicit, non-chiral hydrogen on it. It feels slightly less tricky than the ring closure solution, though still tricky. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
> > Somehow you got the code to generate a "9" for that ring closure, which is > not something that RDKit does naturally, so we are only seeing a step in > the larger part of your goal. > Certainly, but thousands of lines of Python doesn't fit in an email in an easily digestible way. :-) > Since you are already comfortable manipulating the SMILES string directly, > a faster solution is to bypass the toolkit and manipulate the SMILES > directly, as in: > > > import re > > # Match the SMILES for an atom, followed by its closures > atom_pattern = re.compile(r""" > ( > Cl? | # Cl and Br are part of the organic subset > Br? | > [NOSPFIbcnosp*] | # as are these single-letter elements > \[[^]]*\] # everything else must be in []s > ) > """, re.X) > > smiles = 'F9.[C@]91(C)CCO1' > fluorine, core = smiles.split('.') > matches = list(atom_pattern.finditer(core)) > m = matches[3] > new_core = core[:m.end()] + "8" + core[m.end():] > print(new_core) > > The reason I need to drop into a real RDKit molecule is because I want to be able to attach to any implicit hydrogen for my application. I couldn't think of an easy regular expression that located an atom block with one or more implicit hydrogens. So I drop into an RDKit molecule for that part to figure out where are possible hydrogens for me to replace with a functional group. > Also, this: > > >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE) > > is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee > that the nth atom term in the input SMILES is the same as the nth > identifier. It's close, but, for example, explicit '[H]' atoms are usually > turned into implicit hydrogen counts. > Hence the reason I use this to actually parse the SMILES: def MolFromSmilesWithHydrogen(smiles): params = Chem.rdmolfiles.SmilesParserParams() params.removeHs = False return Chem.MolFromSmiles(smiles, params) Even so, in the actual application the atom indices do refer to an actual RDKit molecule that has been scanned for implicit hydrogen locations. Was just trying to keep it 'email simple'. > > I've written code in the past to do this kind of thing for virtual > > library building, using dummy atoms to mark link positions in the > > fragments, and using Perl code to transform between the dummy atoms > > and bond-closure numbers to give text strings which could be assembled > > to give valid dot-disconnected SMILES. This required additional > > lexical transformations in order to maintain valid SMILES depending on > > where the dummy atom was, and to make sure that stereochemistry worked > > properly. If you want to do this kind of thing I don't think you can > > expect to avoid these additional lexical operations. > > This is exactly what mmpdb does, although in Python code. If anyone is > interested, see https://github.com/rdkit/mmpdb/blob/master/mmpdblib/ > smiles_syntax.py . > And I've totally stole your idea and ran with it over the past year or so. :-) Hoping I can talk about it and maybe even open-source it sometime. Want to hook it up to mmpdb if can as well. Cheers, Brian -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
On Nov 9, 2017, at 16:09, Brian Cole wrote: > Here's an example of why this is useful at maintaining molecular > fragmentation inside your molecular representation: > > >>> from rdkit import Chem > >>> smiles = 'F9.[C@]91(C)CCO1' > >>> fluorine, core = smiles.split('.') > >>> fluorine > 'F9' > >>> fragment = core.replace('9', '([*:9])') Somehow you got the code to generate a "9" for that ring closure, which is not something that RDKit does naturally, so we are only seeing a step in the larger part of your goal. The step you gave does a number of transformations to convert: [C@]91(C)CCO1 so the 4th atom has an '8' as an attachment point, that is: [C@]91(C)CC8O1 Since you are already comfortable manipulating the SMILES string directly, a faster solution is to bypass the toolkit and manipulate the SMILES directly, as in: import re # Match the SMILES for an atom, followed by its closures atom_pattern = re.compile(r""" ( Cl? | # Cl and Br are part of the organic subset Br? | [NOSPFIbcnosp*] | # as are these single-letter elements \[[^]]*\] # everything else must be in []s ) """, re.X) smiles = 'F9.[C@]91(C)CCO1' fluorine, core = smiles.split('.') matches = list(atom_pattern.finditer(core)) m = matches[3] new_core = core[:m.end()] + "8" + core[m.end():] print(new_core) Also, this: >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE) is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee that the nth atom term in the input SMILES is the same as the nth identifier. It's close, but, for example, explicit '[H]' atoms are usually turned into implicit hydrogen counts. Finally, there's another assumption in: >>> new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8') Sometimes the result will not be inside of ()s. For example, the same transformation on: F9.[C@]91(C)C(C)O1 produces a new_core of: C[C@@]19OC1C[*:8] when you want it to produce: C[C@@]19OC1C8 For what it's worth, the re-based version generates: [C@]91(C)C(C8)O1 On Nov 9, 2017, at 16:27, Chris Earnshaw wrote: > Trouble is, you're mixing chemical operations and lexical ones. Agreed. > I've written code in the past to do this kind of thing for virtual > library building, using dummy atoms to mark link positions in the > fragments, and using Perl code to transform between the dummy atoms > and bond-closure numbers to give text strings which could be assembled > to give valid dot-disconnected SMILES. This required additional > lexical transformations in order to maintain valid SMILES depending on > where the dummy atom was, and to make sure that stereochemistry worked > properly. If you want to do this kind of thing I don't think you can > expect to avoid these additional lexical operations. This is exactly what mmpdb does, although in Python code. If anyone is interested, see https://github.com/rdkit/mmpdb/blob/master/mmpdblib/smiles_syntax.py . Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
Trouble is, you're mixing chemical operations and lexical ones. It might be handy if this 'just worked' but in practice it's not going to produce valid SMILES without more work. I've written code in the past to do this kind of thing for virtual library building, using dummy atoms to mark link positions in the fragments, and using Perl code to transform between the dummy atoms and bond-closure numbers to give text strings which could be assembled to give valid dot-disconnected SMILES. This required additional lexical transformations in order to maintain valid SMILES depending on where the dummy atom was, and to make sure that stereochemistry worked properly. If you want to do this kind of thing I don't think you can expect to avoid these additional lexical operations. I don't think it's reasonable to expect that invalid SMILES strings should be coerced into giving a particular result for convenience when 1) - they're invalid! and 2) - the behaviour is actually a reasonable interpretation of the order of connections in the SMILES (even though they are invalid). I don't think the current RDKit interpretation of these SMILES should change, though it might be useful if it could issue a warning that SMILES of this type are not correct. Best regards, Chris On 9 November 2017 at 15:09, Brian Cole wrote: > Here's an example of why this is useful at maintaining molecular > fragmentation inside your molecular representation: > from rdkit import Chem smiles = 'F9.[C@]91(C)CCO1' fluorine, core = smiles.split('.') fluorine > 'F9' fragment = core.replace('9', '([*:9])') fragment > '[C@]([*:9])1(C)CCO1' mol = Chem.RWMol(Chem.MolFromSmiles(fragment)) ### RDKit is flipping the stereo on me here even the order of the bonds has not changed idx = mol.AddAtom(Chem.Atom(0)) mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE) > 7 mol.GetAtomWithIdx(idx).SetIntProp("molAtomMapNumber", 8) new_core = Chem.MolToSmiles(mol, True) new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8') new_core > 'C[C@]19CC8O1' analog_smiles = 'Cl8.' + fluorine + '.' + new_core analog_smiles > 'Cl8.F9.C[C@]19CC8O1' analog = Chem.MolFromSmiles(analog_smiles) analog.HasSubstructMatch(Chem.MolFromSmiles(smiles), useChirality=True) # Uh oh! My original molecule didn't match > False analog.HasSubstructMatch(Chem.MolFromSmiles(smiles.replace('@', '@@')), useChirality=True) # flipping the stereo of the original causes it to match again > True > > > > > On Thu, Nov 9, 2017 at 4:41 AM, Andrew Dalke > wrote: >> >> On Nov 9, 2017, at 08:13, Greg Landrum wrote: >> > As was discussed in the comments of >> > https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that >> > the second syntax is even legal. But that's a side point. >> >> To belabor that point. Neither Daylight SMILES nor OpenSMILES accept it, >> which are the only two explicit sources of "legal" that people use. >> >> "allowed" might be a better term. >> >> Andrew >> da...@dalkescientific.com >> >> >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
Here's an example of why this is useful at maintaining molecular fragmentation inside your molecular representation: >>> from rdkit import Chem >>> smiles = 'F9.[C@]91(C)CCO1' >>> fluorine, core = smiles.split('.') >>> fluorine 'F9' >>> fragment = core.replace('9', '([*:9])') >>> fragment '[C@]([*:9])1(C)CCO1' >>> mol = Chem.RWMol(Chem.MolFromSmiles(fragment)) ### RDKit is flipping the stereo on me here even the order of the bonds has not changed >>> idx = mol.AddAtom(Chem.Atom(0)) >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE) 7 >>> mol.GetAtomWithIdx(idx).SetIntProp("molAtomMapNumber", 8) >>> new_core = Chem.MolToSmiles(mol, True) >>> new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8') >>> new_core 'C[C@]19CC8O1' >>> analog_smiles = 'Cl8.' + fluorine + '.' + new_core >>> analog_smiles 'Cl8.F9.C[C@]19CC8O1' >>> analog = Chem.MolFromSmiles(analog_smiles) >>> analog.HasSubstructMatch(Chem.MolFromSmiles(smiles), useChirality=True) # Uh oh! My original molecule didn't match False >>> analog.HasSubstructMatch(Chem.MolFromSmiles(smiles.replace('@', '@@')), useChirality=True) # flipping the stereo of the original causes it to match again True On Thu, Nov 9, 2017 at 4:41 AM, Andrew Dalke wrote: > On Nov 9, 2017, at 08:13, Greg Landrum wrote: > > As was discussed in the comments of https://github.com/rdkit/ > rdkit/issues/786, I think it's pretty gross that the second syntax is > even legal. But that's a side point. > > To belabor that point. Neither Daylight SMILES nor OpenSMILES accept it, > which are the only two explicit sources of "legal" that people use. > > "allowed" might be a better term. > > Andrew > da...@dalkescientific.com > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
On Nov 9, 2017, at 08:13, Greg Landrum wrote: > As was discussed in the comments of > https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that the > second syntax is even legal. But that's a side point. To belabor that point. Neither Daylight SMILES nor OpenSMILES accept it, which are the only two explicit sources of "legal" that people use. "allowed" might be a better term. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
Hi Surely the problem is that some of these SMILES aren't really valid. From the Daylight theory manual: '*The bonds are numbered in any order, designating ring opening (or ring closure) bonds by a digit immediately following the atomic symbol at each ring closure'* (my emphasis). So the behaviour with SMILES where there is an atom between the ring closure digit and the atom to which the ring closure applies (e.g. [C@@](F)1(C)CCO1) may well not be well defined. Arguably RDKit should refuse to process these, but apparently it looks at the atom order and inverts the stereochemistry instead. In Daylight SMILES the @ symbol refers to the order of substituents around the asymmetric atom. If we swap the ring closure digit and one of the atoms then we've changed the order of connections and inverted the stereochemistry, so the current behaviour seems reasonable. Personally I wouldn't change the behaviour - or get RDKit to issue a warning that the SMILES isn't 'strict' in these cases. I think the safest approach is to stick to SMILES which are unequivocally valid, unless RDKit is going to create its own definition of SMILES... Best regards, Chris Earnshaw On 9 November 2017 at 07:13, Greg Landrum wrote: > > On Thu, Nov 9, 2017 at 6:32 AM, Brian Cole wrote: > >> Hi Cheminformaticians, >> >> This is an extreme subtlety in the interpretation of SMILES atom >> stereochemistry and I think a bug in RDKit. Specifically, I think the >> following SMILES should be the same molecule: >> >> >>> rdkit.__version__ >> '2017.09.1' >> >>> Chem.CanonSmiles('F[C@@]1(C)CCO1') >> 'C[C@]1(F)CCO1' >> >>> Chem.CanonSmiles('[C@@](F)1(C)CCO1') >> 'C[C@@]1(F)CCO1' >> > > As was discussed in the comments of https://github.com/rdkit/ > rdkit/issues/786, I think it's pretty gross that the second syntax is > even legal. But that's a side point. > > Since there is no hydrogen inside the stereo carbon atom block the bond >> being 'looked down' should be the first atom encountered. In both cases >> above, that should be the Florine, therefore the molecules should be >> equivalent. >> > > Agreed, and this is a view that's further supported by this behavior: > > In [2]: Chem.CanonSmiles('F[C@@]1(C)CCO1') > Out[2]: 'C[C@]1(F)CCO1' > > In [3]: Chem.CanonSmiles('F[C@@](C)1CCO1') > Out[3]: 'C[C@@]1(F)CCO1' > > Would you mind filing a bug for this and I'll try to track it down/fix it? > > Thanks, > -greg > > > >> >> Though it could be argued the 2nd one is not strict SMILES as Andrew >> describes here: https://github.com/rdkit/rdkit/issues/786 >> >> It is useful when recombining fragments with ring closure digits for >> these to be equivalent: >> [*][C@]1(C)CCO1 >> [C@]([*])1(C)CCO1 >> >> Also, every other tool I can get my hands on agrees they're the same: >> OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough >> canonicalization example for me to work from.) >> >> Sure wish there was a SMILES validation test suite we could all run >> against. And so I'm attaching the examples I used to verify the above so >> whatever poor soul assigned that task later can find this on Google. (I'm >> hopeful :-) >> >> Thanks, >> Brian >> >> PS: the current output from the script: >> >> $ python stereo_handling_first_atom.py >> RDKit = 2017.09.1 >> OEChem = 2.1.2 >> OpenBabel = 2.4.1 >> indigo = 1.2.3.r0-g98188eb mac10.7 >> RDKit failed to recognize these as the same: >> [*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2] >> [C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2] >> OpenBabel failed to recognize these as the same: >> Cl[S@](C)=O -> C[S@](=O)Cl >> [S@](Cl)(C)=O -> C[S@@](=O)Cl >> Indigo failed to recognize these as the same: >> Cl[S@](C)=O -> C[S@](=O)Cl >> [S@](Cl)(C)=O -> C[S@@](=O)Cl >> OpenBabel failed to recognize these as the same: >> Cl[S@](C)= -> =[S@](Cl)C >> [S@](Cl)(C)= -> =[S@@](Cl)C >> Indigo failed to recognize these as the same: >> Cl[S@](C)= -> =[S@@](C)Cl >> [S@](Cl)(C)= -> =[S@](C)Cl >> RDKit failed to recognize these as the same: >> Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1 >> [C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1 >> RDKit failed to recognize these as the same: >> Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1 >> [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1 >> RDKit failed to recognize these as the same: >> Cl3.[C@]31(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1 >> [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1 >> RDKit failed to recognize these as the same: >> Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21 >> [C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21 >> RDKit failed to recognize these as the same: >> [*][C@@H]1CO1 -> [*][C@@H]1CO1 >> [C@H]([*])1CO1 -> [*][C@H]1CO1 >> RDKit failed to recognize these as the same: >> [*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1 >> [C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1 >> RDKit failed to recognize these as the same: >> F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1 >> [C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1 >> RDKit fail
Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
On Thu, Nov 9, 2017 at 6:32 AM, Brian Cole wrote: > Hi Cheminformaticians, > > This is an extreme subtlety in the interpretation of SMILES atom > stereochemistry and I think a bug in RDKit. Specifically, I think the > following SMILES should be the same molecule: > > >>> rdkit.__version__ > '2017.09.1' > >>> Chem.CanonSmiles('F[C@@]1(C)CCO1') > 'C[C@]1(F)CCO1' > >>> Chem.CanonSmiles('[C@@](F)1(C)CCO1') > 'C[C@@]1(F)CCO1' > As was discussed in the comments of https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that the second syntax is even legal. But that's a side point. Since there is no hydrogen inside the stereo carbon atom block the bond > being 'looked down' should be the first atom encountered. In both cases > above, that should be the Florine, therefore the molecules should be > equivalent. > Agreed, and this is a view that's further supported by this behavior: In [2]: Chem.CanonSmiles('F[C@@]1(C)CCO1') Out[2]: 'C[C@]1(F)CCO1' In [3]: Chem.CanonSmiles('F[C@@](C)1CCO1') Out[3]: 'C[C@@]1(F)CCO1' Would you mind filing a bug for this and I'll try to track it down/fix it? Thanks, -greg > > Though it could be argued the 2nd one is not strict SMILES as Andrew > describes here: https://github.com/rdkit/rdkit/issues/786 > > It is useful when recombining fragments with ring closure digits for these > to be equivalent: > [*][C@]1(C)CCO1 > [C@]([*])1(C)CCO1 > > Also, every other tool I can get my hands on agrees they're the same: > OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough > canonicalization example for me to work from.) > > Sure wish there was a SMILES validation test suite we could all run > against. And so I'm attaching the examples I used to verify the above so > whatever poor soul assigned that task later can find this on Google. (I'm > hopeful :-) > > Thanks, > Brian > > PS: the current output from the script: > > $ python stereo_handling_first_atom.py > RDKit = 2017.09.1 > OEChem = 2.1.2 > OpenBabel = 2.4.1 > indigo = 1.2.3.r0-g98188eb mac10.7 > RDKit failed to recognize these as the same: > [*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2] > [C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2] > OpenBabel failed to recognize these as the same: > Cl[S@](C)=O -> C[S@](=O)Cl > [S@](Cl)(C)=O -> C[S@@](=O)Cl > Indigo failed to recognize these as the same: > Cl[S@](C)=O -> C[S@](=O)Cl > [S@](Cl)(C)=O -> C[S@@](=O)Cl > OpenBabel failed to recognize these as the same: > Cl[S@](C)= -> =[S@](Cl)C > [S@](Cl)(C)= -> =[S@@](Cl)C > Indigo failed to recognize these as the same: > Cl[S@](C)= -> =[S@@](C)Cl > [S@](Cl)(C)= -> =[S@](C)Cl > RDKit failed to recognize these as the same: > Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1 > [C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1 > RDKit failed to recognize these as the same: > Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1 > [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1 > RDKit failed to recognize these as the same: > Cl3.[C@]31(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1 > [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1 > RDKit failed to recognize these as the same: > Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21 > [C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21 > RDKit failed to recognize these as the same: > [*][C@@H]1CO1 -> [*][C@@H]1CO1 > [C@H]([*])1CO1 -> [*][C@H]1CO1 > RDKit failed to recognize these as the same: > [*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1 > [C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1 > RDKit failed to recognize these as the same: > F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1 > [C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1 > RDKit failed to recognize these as the same: > Cl[C@@H]1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@H](Cl)[C@H]1Cl > [C@H](Cl)1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@@H](Cl)[C@H]1Cl > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
Hi Cheminformaticians, This is an extreme subtlety in the interpretation of SMILES atom stereochemistry and I think a bug in RDKit. Specifically, I think the following SMILES should be the same molecule: >>> rdkit.__version__ '2017.09.1' >>> Chem.CanonSmiles('F[C@@]1(C)CCO1') 'C[C@]1(F)CCO1' >>> Chem.CanonSmiles('[C@@](F)1(C)CCO1') 'C[C@@]1(F)CCO1' Since there is no hydrogen inside the stereo carbon atom block the bond being 'looked down' should be the first atom encountered. In both cases above, that should be the Florine, therefore the molecules should be equivalent. Though it could be argued the 2nd one is not strict SMILES as Andrew describes here: https://github.com/rdkit/rdkit/issues/786 It is useful when recombining fragments with ring closure digits for these to be equivalent: [*][C@]1(C)CCO1 [C@]([*])1(C)CCO1 Also, every other tool I can get my hands on agrees they're the same: OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough canonicalization example for me to work from.) Sure wish there was a SMILES validation test suite we could all run against. And so I'm attaching the examples I used to verify the above so whatever poor soul assigned that task later can find this on Google. (I'm hopeful :-) Thanks, Brian PS: the current output from the script: $ python stereo_handling_first_atom.py RDKit = 2017.09.1 OEChem = 2.1.2 OpenBabel = 2.4.1 indigo = 1.2.3.r0-g98188eb mac10.7 RDKit failed to recognize these as the same: [*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2] [C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2] OpenBabel failed to recognize these as the same: Cl[S@](C)=O -> C[S@](=O)Cl [S@](Cl)(C)=O -> C[S@@](=O)Cl Indigo failed to recognize these as the same: Cl[S@](C)=O -> C[S@](=O)Cl [S@](Cl)(C)=O -> C[S@@](=O)Cl OpenBabel failed to recognize these as the same: Cl[S@](C)= -> =[S@](Cl)C [S@](Cl)(C)= -> =[S@@](Cl)C Indigo failed to recognize these as the same: Cl[S@](C)= -> =[S@@](C)Cl [S@](Cl)(C)= -> =[S@](C)Cl RDKit failed to recognize these as the same: Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1 [C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1 RDKit failed to recognize these as the same: Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1 [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1 RDKit failed to recognize these as the same: Cl3.[C@]31(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1 [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1 RDKit failed to recognize these as the same: Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21 [C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21 RDKit failed to recognize these as the same: [*][C@@H]1CO1 -> [*][C@@H]1CO1 [C@H]([*])1CO1 -> [*][C@H]1CO1 RDKit failed to recognize these as the same: [*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1 [C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1 RDKit failed to recognize these as the same: F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1 [C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1 RDKit failed to recognize these as the same: Cl[C@@H]1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@H](Cl)[C@H]1Cl [C@H](Cl)1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@@H](Cl)[C@H]1Cl from __future__ import print_function import sys sys.path.append('/Users/coleb/indigo-python') import rdkit from rdkit import Chem from openeye.oechem import * import pybel import openbabel import indigo print("RDKit =", rdkit.__version__) print("OEChem =", OEChemGetRelease()) print("OpenBabel =", openbabel.OBReleaseVersion()) print("indigo =", indigo.Indigo().version()) def indigoCanSmi(smi): return indigo.Indigo().loadMolecule(smi).canonicalSmiles() def OBCanSmi(smi): return pybel.readstring("smi", smi).write("can").strip() def OECanSmi(smi): mol = OEGraphMol() OESmilesToMol(mol, smi) return OEMolToSmiles(mol) SMILES_THAT_ARE_THE_SAME = [ # examples from OpenSMILES spec, every toolkits agrees on these at least ('N[C@H](O)C', '[C@@H](N)(O)C'), ('N[C@@](Br)(C)O', 'N[C@](Br)(O)C'), ('Br[C@](N)(C)O','O[C@](Br)(C)N'), # examples with attachment points every toolkit agrees on ('[*][C@@H](C)N','[C@H]([*])(C)N'), ('[*][C@@](F)(C)N', '[C@@]([*])(F)(C)N'), # examples from Dalke 2017 UGM talk ('[*][C@](N)(O)S', '[C@]([*])(N)(O)S'), ('[*][C@H](O)S', '[C@@H]([*])(O)S'), ('[*:1][C@]1([*:2])CC1(Cl)Cl', '[C@]([*:1])1([*:2])CC1(Cl)Cl'), # RDKit thinks these are different, and ChemAxon removes this stereochemistry entirely # example from Dalke report here: https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg05296.html ('CN[S@](c1c1)=O', 'CN[S@]2=O.c12c1'), # non-ring cases with no hydrogen property ('Cl[C@](F)(Br)(O)', '[C@](Cl)(F)(Br)(O)'), # OpenBabel and Indigo have issues with sulfur chirality ('Cl[S@](C)=O', '[S@](Cl)(C)=O'),# ChemAxon flips this chirality ('Cl[S@](C)=', '[S@](Cl)(C)='), # ChemAxon removes this chirality entirely # ring cases, RDKit thinks all of these are different