Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Andrew Dalke
On Nov 9, 2017, at 21:49, Brian Cole  wrote:
> Certainly, but thousands of lines of Python doesn't fit in an email in an 
> easily digestible way. :-)

I'll restate things since I wasn't clear. While this step may be what you need 
for the way you structure things, there might be a better way to structure 
things. I can't tell because I don't know what it is you are trying to do.


> The reason I need to drop into a real RDKit molecule is because I want to be 
> able to attach to any implicit hydrogen for my application. I couldn't think 
> of an easy regular expression that located an atom block with one or more 
> implicit hydrogens.

There isn't one. That requires at least a context-free grammar because it needs 
to count the valence used by branches, and branches can be arbitrarily nested.

I think your "any implicit hydrogen" will have problems when the implicit 
hydrogen count is specified in square brackets, as with a chiral hydrogen, or 
an atom outside of the organic subset, or one with another property specified 
(e.g., isotopes or charge).

Leaving the tricky chiral hydrogen aside, you're turning:

  [C@]([*:9])1(C)C[SiH2]O1

where the silicon has an implicit hydrogen count of 2 and a valence of 4, into 

  C[C@]19C[SiH2]8O1

where the silicon is now 5-valent. Similarly,

  [B-]1OCC[NH+]1
becomes
  [B-]1OCC[NH+]18

If you have some way to annotate which atoms have at least one implicit 
hydrogen then you can use the regular expression from my last email, and if it 
uses []s then reach in and reduce the H count by 1 as part of the 
transformation.

You'll still need some special code to deal with chiral hydrogens.

BTW, I don't think you need closures for this at all. You have a set of 
fragments, where you know which atom will be attached, and I believe you 
control the ordering of the atoms in that fragment.

If you use MolToSmiles(rootedAtAtom) so that the attachment atom is always 
first, e.g. place the "O" in the phenol first:

Oc1c1

then you can attach it to the core at a given point using a branch, e.g., to 
attach it to the 4th atom of:

  [C@]1(C)CCO1 -> [C@]1(C)CC(Oc1c1)O1

This is the 4th regular expression match (to find the atom, and skip any ring 
closures), followed by '(' followed by the rooted fragment followed by ')' 
followed by the rest of the original string. Plus some modification of the 
regular expression match itself to reduce the H count, if the implicit H-count 
is stated explicitly. Assuming the core attachment point has at least one 
implicit, non-chiral hydrogen on it.

It feels slightly less tricky than the ring closure solution, though still 
tricky.


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Brian Cole
>
> Somehow you got the code to generate a "9" for that ring closure, which is
> not something that RDKit does naturally, so we are only seeing a step in
> the larger part of your goal.
>

Certainly, but thousands of lines of Python doesn't fit in an email in an
easily digestible way. :-)


> Since you are already comfortable manipulating the SMILES string directly,
> a faster solution is to bypass the toolkit and manipulate the SMILES
> directly, as in:
>
> 
> import re
>
> # Match the SMILES for an atom, followed by its closures
> atom_pattern = re.compile(r"""
> (
>  Cl? | # Cl and Br are part of the organic subset
>  Br? |
>  [NOSPFIbcnosp*] |  # as are these single-letter elements
>  \[[^]]*\] # everything else must be in []s
> )
> """, re.X)
>
> smiles = 'F9.[C@]91(C)CCO1'
> fluorine, core = smiles.split('.')
> matches = list(atom_pattern.finditer(core))
> m = matches[3]
> new_core = core[:m.end()] + "8" + core[m.end():]
> print(new_core)
> 
>

The reason I need to drop into a real RDKit molecule is because I want to
be able to attach to any implicit hydrogen for my application. I couldn't
think of an easy regular expression that located an atom block with one or
more implicit hydrogens. So I drop into an RDKit molecule for that part to
figure out where are possible hydrogens for me to replace with a functional
group.


> Also, this:
>
>   >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)
>
> is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee
> that the nth atom term in the input SMILES is the same as the nth
> identifier. It's close, but, for example, explicit '[H]' atoms are usually
> turned into implicit hydrogen counts.
>

Hence the reason I use this to actually parse the SMILES:

def MolFromSmilesWithHydrogen(smiles):
params = Chem.rdmolfiles.SmilesParserParams()
params.removeHs = False
return Chem.MolFromSmiles(smiles, params)

Even so, in the actual application the atom indices do refer to an actual
RDKit molecule that has been scanned for implicit hydrogen locations. Was
just trying to keep it 'email simple'.


> > I've written code in the past to do this kind of thing for virtual
> > library building, using dummy atoms to mark link positions in the
> > fragments, and using Perl code to transform between the dummy atoms
> > and bond-closure numbers to give text strings which could be assembled
> > to give valid dot-disconnected SMILES. This required additional
> > lexical transformations in order to maintain valid SMILES depending on
> > where the dummy atom was, and to make sure that stereochemistry worked
> > properly. If you want to do this kind of thing I don't think you can
> > expect to avoid these additional lexical operations.
>
> This is exactly what mmpdb does, although in Python code. If anyone is
> interested, see https://github.com/rdkit/mmpdb/blob/master/mmpdblib/
> smiles_syntax.py .
>

And I've totally stole your idea and ran with it over the past year or so.
:-)

Hoping I can talk about it and maybe even open-source it sometime. Want to
hook it up to mmpdb if can as well.

Cheers,
Brian
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Andrew Dalke
On Nov 9, 2017, at 16:09, Brian Cole  wrote:
> Here's an example of why this is useful at maintaining molecular 
> fragmentation inside your molecular representation:
> 
>  >>> from rdkit import Chem
>  >>> smiles = 'F9.[C@]91(C)CCO1'
>  >>> fluorine, core = smiles.split('.')
>  >>> fluorine
>  'F9'
>  >>> fragment = core.replace('9', '([*:9])')

Somehow you got the code to generate a "9" for that ring closure, which is not 
something that RDKit does naturally, so we are only seeing a step in the larger 
part of your goal.

The step you gave does a number of transformations to convert:

  [C@]91(C)CCO1

so the 4th atom has an '8' as an attachment point, that is:

  [C@]91(C)CC8O1

Since you are already comfortable manipulating the SMILES string directly, a 
faster solution is to bypass the toolkit and manipulate the SMILES directly, as 
in:


import re

# Match the SMILES for an atom, followed by its closures
atom_pattern = re.compile(r"""
(
 Cl? | # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp*] |  # as are these single-letter elements
 \[[^]]*\] # everything else must be in []s
)
""", re.X)

smiles = 'F9.[C@]91(C)CCO1'
fluorine, core = smiles.split('.')
matches = list(atom_pattern.finditer(core))
m = matches[3]
new_core = core[:m.end()] + "8" + core[m.end():]
print(new_core)


Also, this:

  >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)

is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee that 
the nth atom term in the input SMILES is the same as the nth identifier. It's 
close, but, for example, explicit '[H]' atoms are usually turned into implicit 
hydrogen counts.

Finally, there's another assumption in:
  >>> new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8')

Sometimes the result will not be inside of ()s. For example, the same 
transformation on:

  F9.[C@]91(C)C(C)O1

produces a new_core of:

  C[C@@]19OC1C[*:8]

when you want it to produce:

  C[C@@]19OC1C8

For what it's worth, the re-based version generates:

  [C@]91(C)C(C8)O1


On Nov 9, 2017, at 16:27, Chris Earnshaw  wrote:
> Trouble is, you're mixing chemical operations and lexical ones.

Agreed.

> I've written code in the past to do this kind of thing for virtual
> library building, using dummy atoms to mark link positions in the
> fragments, and using Perl code to transform between the dummy atoms
> and bond-closure numbers to give text strings which could be assembled
> to give valid dot-disconnected SMILES. This required additional
> lexical transformations in order to maintain valid SMILES depending on
> where the dummy atom was, and to make sure that stereochemistry worked
> properly. If you want to do this kind of thing I don't think you can
> expect to avoid these additional lexical operations.

This is exactly what mmpdb does, although in Python code. If anyone is 
interested, see 
https://github.com/rdkit/mmpdb/blob/master/mmpdblib/smiles_syntax.py .

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Chris Earnshaw
Trouble is, you're mixing chemical operations and lexical ones. It
might be handy if this 'just worked' but in practice it's not going to
produce valid SMILES without more work.

I've written code in the past to do this kind of thing for virtual
library building, using dummy atoms to mark link positions in the
fragments, and using Perl code to transform between the dummy atoms
and bond-closure numbers to give text strings which could be assembled
to give valid dot-disconnected SMILES. This required additional
lexical transformations in order to maintain valid SMILES depending on
where the dummy atom was, and to make sure that stereochemistry worked
properly. If you want to do this kind of thing I don't think you can
expect to avoid these additional lexical operations.

I don't think it's reasonable to expect that invalid SMILES strings
should be coerced into giving a particular result for convenience when
1) - they're invalid! and 2) - the behaviour is actually a reasonable
interpretation of the order of connections in the SMILES (even though
they are invalid).

I don't think the current RDKit interpretation of these SMILES should
change, though it might be useful if it could issue a warning that
SMILES of this type are not correct.

Best regards,
Chris

On 9 November 2017 at 15:09, Brian Cole  wrote:
> Here's an example of why this is useful at maintaining molecular
> fragmentation inside your molecular representation:
>
 from rdkit import Chem
 smiles = 'F9.[C@]91(C)CCO1'
 fluorine, core = smiles.split('.')
 fluorine
> 'F9'
 fragment = core.replace('9', '([*:9])')
 fragment
> '[C@]([*:9])1(C)CCO1'
 mol = Chem.RWMol(Chem.MolFromSmiles(fragment))  ### RDKit is flipping
 the stereo on me here even the order of the bonds has not changed
 idx = mol.AddAtom(Chem.Atom(0))
 mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)
> 7
 mol.GetAtomWithIdx(idx).SetIntProp("molAtomMapNumber", 8)
 new_core = Chem.MolToSmiles(mol, True)
 new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8')
 new_core
> 'C[C@]19CC8O1'
 analog_smiles = 'Cl8.' + fluorine + '.' + new_core
 analog_smiles
> 'Cl8.F9.C[C@]19CC8O1'
 analog = Chem.MolFromSmiles(analog_smiles)
 analog.HasSubstructMatch(Chem.MolFromSmiles(smiles), useChirality=True)
 # Uh oh! My original molecule didn't match
> False
 analog.HasSubstructMatch(Chem.MolFromSmiles(smiles.replace('@', '@@')),
 useChirality=True)   # flipping the stereo of the original causes it to
 match again
> True
>
>
>
>
> On Thu, Nov 9, 2017 at 4:41 AM, Andrew Dalke 
> wrote:
>>
>> On Nov 9, 2017, at 08:13, Greg Landrum  wrote:
>> > As was discussed in the comments of
>> > https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that
>> > the second syntax is even legal. But that's a side point.
>>
>> To belabor that point. Neither Daylight SMILES nor OpenSMILES accept it,
>> which are the only two explicit sources of "legal" that people use.
>>
>> "allowed" might be a better term.
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Brian Cole
Here's an example of why this is useful at maintaining molecular
fragmentation inside your molecular representation:

>>> from rdkit import Chem
>>> smiles = 'F9.[C@]91(C)CCO1'
>>> fluorine, core = smiles.split('.')
>>> fluorine
'F9'
>>> fragment = core.replace('9', '([*:9])')
>>> fragment
'[C@]([*:9])1(C)CCO1'
>>> mol = Chem.RWMol(Chem.MolFromSmiles(fragment))  ### RDKit is flipping
the stereo on me here even the order of the bonds has not changed
>>> idx = mol.AddAtom(Chem.Atom(0))
>>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)
7
>>> mol.GetAtomWithIdx(idx).SetIntProp("molAtomMapNumber", 8)
>>> new_core = Chem.MolToSmiles(mol, True)
>>> new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8')
>>> new_core
'C[C@]19CC8O1'
>>> analog_smiles = 'Cl8.' + fluorine + '.' + new_core
>>> analog_smiles
'Cl8.F9.C[C@]19CC8O1'
>>> analog = Chem.MolFromSmiles(analog_smiles)
>>> analog.HasSubstructMatch(Chem.MolFromSmiles(smiles),
useChirality=True)  # Uh oh! My original molecule didn't match
False
>>> analog.HasSubstructMatch(Chem.MolFromSmiles(smiles.replace('@', '@@')),
useChirality=True)   # flipping the stereo of the original causes it to
match again
True




On Thu, Nov 9, 2017 at 4:41 AM, Andrew Dalke 
wrote:

> On Nov 9, 2017, at 08:13, Greg Landrum  wrote:
> > As was discussed in the comments of https://github.com/rdkit/
> rdkit/issues/786, I think it's pretty gross that the second syntax is
> even legal. But that's a side point.
>
> To belabor that point. Neither Daylight SMILES nor OpenSMILES accept it,
> which are the only two explicit sources of "legal" that people use.
>
> "allowed" might be a better term.
>
> Andrew
> da...@dalkescientific.com
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Andrew Dalke
On Nov 9, 2017, at 08:13, Greg Landrum  wrote:
> As was discussed in the comments of 
> https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that the 
> second syntax is even legal. But that's a side point.

To belabor that point. Neither Daylight SMILES nor OpenSMILES accept it, which 
are the only two explicit sources of "legal" that people use.

"allowed" might be a better term.

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-08 Thread Chris Earnshaw
Hi

Surely the problem is that some of these SMILES aren't really valid. From
the Daylight theory manual: '*The bonds are numbered in any order,
designating ring opening (or ring closure) bonds by a digit immediately
following the atomic symbol at each ring closure'*  (my emphasis).

So the behaviour with SMILES where there is an atom between the ring
closure digit and the atom to which the ring closure applies (e.g.
[C@@](F)1(C)CCO1)
may well not be well defined. Arguably RDKit should refuse to process
these, but apparently it looks at the atom order and inverts the
stereochemistry instead. In Daylight SMILES the @ symbol refers to the
order of substituents around the asymmetric atom. If we swap the ring
closure digit and one of the atoms then we've changed the order of
connections and inverted the stereochemistry, so the current behaviour
seems reasonable. Personally I wouldn't change the behaviour - or get RDKit
to issue a warning that the SMILES isn't 'strict' in these cases.

I think the safest approach is to stick to SMILES which are unequivocally
valid, unless RDKit is going to create its own definition of SMILES...


Best regards,
Chris Earnshaw

On 9 November 2017 at 07:13, Greg Landrum  wrote:

>
> On Thu, Nov 9, 2017 at 6:32 AM, Brian Cole  wrote:
>
>> Hi Cheminformaticians,
>>
>> This is an extreme subtlety in the interpretation of SMILES atom
>> stereochemistry and I think a bug in RDKit. Specifically, I think the
>> following SMILES should be the same molecule:
>>
>> >>> rdkit.__version__
>> '2017.09.1'
>> >>> Chem.CanonSmiles('F[C@@]1(C)CCO1')
>> 'C[C@]1(F)CCO1'
>> >>> Chem.CanonSmiles('[C@@](F)1(C)CCO1')
>> 'C[C@@]1(F)CCO1'
>>
>
> As was discussed in the comments of https://github.com/rdkit/
> rdkit/issues/786, I think it's pretty gross that the second syntax is
> even legal. But that's a side point.
>
> Since there is no hydrogen inside the stereo carbon atom block the bond
>> being 'looked down' should be the first atom encountered. In both cases
>> above, that should be the Florine, therefore the molecules should be
>> equivalent.
>>
>
> Agreed, and this is a view that's further supported by this behavior:
>
> In [2]: Chem.CanonSmiles('F[C@@]1(C)CCO1')
> Out[2]: 'C[C@]1(F)CCO1'
>
> In [3]: Chem.CanonSmiles('F[C@@](C)1CCO1')
> Out[3]: 'C[C@@]1(F)CCO1'
>
> Would you mind filing a bug for this and I'll try to track it down/fix it?
>
> Thanks,
> -greg
>
>
>
>>
>> Though it could be argued the 2nd one is not strict SMILES as Andrew
>> describes here: https://github.com/rdkit/rdkit/issues/786
>>
>> It is useful when recombining fragments with ring closure digits for
>> these to be equivalent:
>> [*][C@]1(C)CCO1
>> [C@]([*])1(C)CCO1
>>
>> Also, every other tool I can get my hands on agrees they're the same:
>> OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough
>> canonicalization example for me to work from.)
>>
>> Sure wish there was a SMILES validation test suite we could all run
>> against. And so I'm attaching the examples I used to verify the above so
>> whatever poor soul assigned that task later can find this on Google. (I'm
>> hopeful :-)
>>
>> Thanks,
>> Brian
>>
>> PS: the current output from the script:
>>
>> $ python stereo_handling_first_atom.py
>> RDKit = 2017.09.1
>> OEChem = 2.1.2
>> OpenBabel = 2.4.1
>> indigo = 1.2.3.r0-g98188eb mac10.7
>> RDKit failed to recognize these as the same:
>> [*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2]
>> [C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2]
>> OpenBabel failed to recognize these as the same:
>> Cl[S@](C)=O -> C[S@](=O)Cl
>> [S@](Cl)(C)=O -> C[S@@](=O)Cl
>> Indigo failed to recognize these as the same:
>> Cl[S@](C)=O -> C[S@](=O)Cl
>> [S@](Cl)(C)=O -> C[S@@](=O)Cl
>> OpenBabel failed to recognize these as the same:
>> Cl[S@](C)= -> =[S@](Cl)C
>> [S@](Cl)(C)= -> =[S@@](Cl)C
>> Indigo failed to recognize these as the same:
>> Cl[S@](C)= -> =[S@@](C)Cl
>> [S@](Cl)(C)= -> =[S@](C)Cl
>> RDKit failed to recognize these as the same:
>> Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1
>> [C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1
>> RDKit failed to recognize these as the same:
>> Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
>> [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1
>> RDKit failed to recognize these as the same:
>> Cl3.[C@]31(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
>> [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1
>> RDKit failed to recognize these as the same:
>> Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21
>> [C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21
>> RDKit failed to recognize these as the same:
>> [*][C@@H]1CO1 -> [*][C@@H]1CO1
>> [C@H]([*])1CO1 -> [*][C@H]1CO1
>> RDKit failed to recognize these as the same:
>> [*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1
>> [C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1
>> RDKit failed to recognize these as the same:
>> F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1
>> [C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1
>> RDKit fail

Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-08 Thread Greg Landrum
On Thu, Nov 9, 2017 at 6:32 AM, Brian Cole  wrote:

> Hi Cheminformaticians,
>
> This is an extreme subtlety in the interpretation of SMILES atom
> stereochemistry and I think a bug in RDKit. Specifically, I think the
> following SMILES should be the same molecule:
>
> >>> rdkit.__version__
> '2017.09.1'
> >>> Chem.CanonSmiles('F[C@@]1(C)CCO1')
> 'C[C@]1(F)CCO1'
> >>> Chem.CanonSmiles('[C@@](F)1(C)CCO1')
> 'C[C@@]1(F)CCO1'
>

As was discussed in the comments of
https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that
the second syntax is even legal. But that's a side point.

Since there is no hydrogen inside the stereo carbon atom block the bond
> being 'looked down' should be the first atom encountered. In both cases
> above, that should be the Florine, therefore the molecules should be
> equivalent.
>

Agreed, and this is a view that's further supported by this behavior:

In [2]: Chem.CanonSmiles('F[C@@]1(C)CCO1')
Out[2]: 'C[C@]1(F)CCO1'

In [3]: Chem.CanonSmiles('F[C@@](C)1CCO1')
Out[3]: 'C[C@@]1(F)CCO1'

Would you mind filing a bug for this and I'll try to track it down/fix it?

Thanks,
-greg



>
> Though it could be argued the 2nd one is not strict SMILES as Andrew
> describes here: https://github.com/rdkit/rdkit/issues/786
>
> It is useful when recombining fragments with ring closure digits for these
> to be equivalent:
> [*][C@]1(C)CCO1
> [C@]([*])1(C)CCO1
>
> Also, every other tool I can get my hands on agrees they're the same:
> OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough
> canonicalization example for me to work from.)
>
> Sure wish there was a SMILES validation test suite we could all run
> against. And so I'm attaching the examples I used to verify the above so
> whatever poor soul assigned that task later can find this on Google. (I'm
> hopeful :-)
>
> Thanks,
> Brian
>
> PS: the current output from the script:
>
> $ python stereo_handling_first_atom.py
> RDKit = 2017.09.1
> OEChem = 2.1.2
> OpenBabel = 2.4.1
> indigo = 1.2.3.r0-g98188eb mac10.7
> RDKit failed to recognize these as the same:
> [*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2]
> [C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2]
> OpenBabel failed to recognize these as the same:
> Cl[S@](C)=O -> C[S@](=O)Cl
> [S@](Cl)(C)=O -> C[S@@](=O)Cl
> Indigo failed to recognize these as the same:
> Cl[S@](C)=O -> C[S@](=O)Cl
> [S@](Cl)(C)=O -> C[S@@](=O)Cl
> OpenBabel failed to recognize these as the same:
> Cl[S@](C)= -> =[S@](Cl)C
> [S@](Cl)(C)= -> =[S@@](Cl)C
> Indigo failed to recognize these as the same:
> Cl[S@](C)= -> =[S@@](C)Cl
> [S@](Cl)(C)= -> =[S@](C)Cl
> RDKit failed to recognize these as the same:
> Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1
> [C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1
> RDKit failed to recognize these as the same:
> Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
> [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1
> RDKit failed to recognize these as the same:
> Cl3.[C@]31(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
> [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1
> RDKit failed to recognize these as the same:
> Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21
> [C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21
> RDKit failed to recognize these as the same:
> [*][C@@H]1CO1 -> [*][C@@H]1CO1
> [C@H]([*])1CO1 -> [*][C@H]1CO1
> RDKit failed to recognize these as the same:
> [*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1
> [C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1
> RDKit failed to recognize these as the same:
> F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1
> [C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1
> RDKit failed to recognize these as the same:
> Cl[C@@H]1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@H](Cl)[C@H]1Cl
> [C@H](Cl)1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@@H](Cl)[C@H]1Cl
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-08 Thread Brian Cole
Hi Cheminformaticians,

This is an extreme subtlety in the interpretation of SMILES atom
stereochemistry and I think a bug in RDKit. Specifically, I think the
following SMILES should be the same molecule:

>>> rdkit.__version__
'2017.09.1'
>>> Chem.CanonSmiles('F[C@@]1(C)CCO1')
'C[C@]1(F)CCO1'
>>> Chem.CanonSmiles('[C@@](F)1(C)CCO1')
'C[C@@]1(F)CCO1'

Since there is no hydrogen inside the stereo carbon atom block the bond
being 'looked down' should be the first atom encountered. In both cases
above, that should be the Florine, therefore the molecules should be
equivalent.

Though it could be argued the 2nd one is not strict SMILES as Andrew
describes here: https://github.com/rdkit/rdkit/issues/786

It is useful when recombining fragments with ring closure digits for these
to be equivalent:
[*][C@]1(C)CCO1
[C@]([*])1(C)CCO1

Also, every other tool I can get my hands on agrees they're the same:
OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough
canonicalization example for me to work from.)

Sure wish there was a SMILES validation test suite we could all run
against. And so I'm attaching the examples I used to verify the above so
whatever poor soul assigned that task later can find this on Google. (I'm
hopeful :-)

Thanks,
Brian

PS: the current output from the script:

$ python stereo_handling_first_atom.py
RDKit = 2017.09.1
OEChem = 2.1.2
OpenBabel = 2.4.1
indigo = 1.2.3.r0-g98188eb mac10.7
RDKit failed to recognize these as the same:
[*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2]
[C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2]
OpenBabel failed to recognize these as the same:
Cl[S@](C)=O -> C[S@](=O)Cl
[S@](Cl)(C)=O -> C[S@@](=O)Cl
Indigo failed to recognize these as the same:
Cl[S@](C)=O -> C[S@](=O)Cl
[S@](Cl)(C)=O -> C[S@@](=O)Cl
OpenBabel failed to recognize these as the same:
Cl[S@](C)= -> =[S@](Cl)C
[S@](Cl)(C)= -> =[S@@](Cl)C
Indigo failed to recognize these as the same:
Cl[S@](C)= -> =[S@@](C)Cl
[S@](Cl)(C)= -> =[S@](C)Cl
RDKit failed to recognize these as the same:
Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1
[C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1
RDKit failed to recognize these as the same:
Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
[C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1
RDKit failed to recognize these as the same:
Cl3.[C@]31(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
[C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1
RDKit failed to recognize these as the same:
Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21
[C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21
RDKit failed to recognize these as the same:
[*][C@@H]1CO1 -> [*][C@@H]1CO1
[C@H]([*])1CO1 -> [*][C@H]1CO1
RDKit failed to recognize these as the same:
[*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1
[C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1
RDKit failed to recognize these as the same:
F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1
[C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1
RDKit failed to recognize these as the same:
Cl[C@@H]1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@H](Cl)[C@H]1Cl
[C@H](Cl)1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@@H](Cl)[C@H]1Cl
from __future__ import print_function
import sys
sys.path.append('/Users/coleb/indigo-python')

import rdkit
from rdkit import Chem

from openeye.oechem import *

import pybel
import openbabel

import indigo

print("RDKit =", rdkit.__version__)
print("OEChem =", OEChemGetRelease())
print("OpenBabel =", openbabel.OBReleaseVersion())
print("indigo =", indigo.Indigo().version())

def indigoCanSmi(smi):
return indigo.Indigo().loadMolecule(smi).canonicalSmiles()

def OBCanSmi(smi):
return pybel.readstring("smi", smi).write("can").strip()

def OECanSmi(smi):
mol = OEGraphMol()
OESmilesToMol(mol, smi)
return OEMolToSmiles(mol)

SMILES_THAT_ARE_THE_SAME = [
# examples from OpenSMILES spec, every toolkits agrees on these at least
('N[C@H](O)C',   '[C@@H](N)(O)C'),
('N[C@@](Br)(C)O',   'N[C@](Br)(O)C'),
('Br[C@](N)(C)O','O[C@](Br)(C)N'),

# examples with attachment points every toolkit agrees on
('[*][C@@H](C)N','[C@H]([*])(C)N'),
('[*][C@@](F)(C)N',  '[C@@]([*])(F)(C)N'),

# examples from Dalke 2017 UGM talk
('[*][C@](N)(O)S', '[C@]([*])(N)(O)S'),
('[*][C@H](O)S',   '[C@@H]([*])(O)S'),
('[*:1][C@]1([*:2])CC1(Cl)Cl', '[C@]([*:1])1([*:2])CC1(Cl)Cl'), # RDKit thinks these are different, and ChemAxon removes this stereochemistry entirely

# example from Dalke report here: https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg05296.html
('CN[S@](c1c1)=O', 'CN[S@]2=O.c12c1'),

# non-ring cases with no hydrogen property
('Cl[C@](F)(Br)(O)', '[C@](Cl)(F)(Br)(O)'),

# OpenBabel and Indigo have issues with sulfur chirality
('Cl[S@](C)=O',  '[S@](Cl)(C)=O'),# ChemAxon flips this chirality
('Cl[S@](C)=',   '[S@](Cl)(C)='), # ChemAxon removes this chirality entirely

# ring cases, RDKit thinks all of these are different