Re: [Rdkit-discuss] Canonicalisation with reaction labels
On Dec 16, 2016, at 3:27 PM, Andrew Dalke wrote: > 2013 RDKit didn't preserve the atom order between labeled and unlabeled atoms. ... > I no longer have an older version of RDKit installed. My memory is wrong. I have rebuilt a version from 2013 and been unable to find a failure case. That is, overnight I had it fragment every bond from about 450,000 structures and make either "[*:1].[*:1]" or "[*:1].[*:2]" pairs, then compared the canonical "de-labeled" SMILES (the canonical labeled SMILES with the :1/:2 removed) to the unlabeled canonical SMILES. (I also specified the isotope as 2*atomic number so there wouldn't be a problem with the brackets.) In every case the unlabeled and delabled SMILES were identical. I tried some other variations but still found no mismatches. > Going through my notes, here was one of the failure cases: > > core => > Cc1cc2c3c(c1)C[N@]([*])CCN(C)CC[N@@]([*])Cc1cc(C)cc(c1OCCCO3)C[N@@](C)CCN(C)CC[N@](C)C2 > syntax=> > Cc1cc2c3c(c1)C[N@]([*:1])CCN(C)CC[N@@]([*:2])Cc1cc(C)cc(c1OCCCO3)C[N@](C)CCN(C)CC[N@@](C)C2 > canonical => > Cc1cc2c3c(c1)C[N@]([*:2])CCN(C)CC[N@@]([*:1])Cc1cc(C)cc(c1OCCCO3)C[N@@](C)CCN(C)CC[N@](C)C2 It appears this was from another problem. I wanted to fragment the structure and produce a canonical fragmented structure. The problem in the above is that the labels of "1" and "2" break the symmetry and lead to different canonical outputs. It is not related to the question you [Stephen] asked. It may be that I did my analysis of canonical atom order in labeled/unlabeled SMILES with a newer version of the toolkit than 2013. In any case, I am surprised to find how stable those labels are in the 2013 release. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Canonicalisation with reaction labels
Interesting question. Not sure if it's relevant but in ECBlast we do provide Canonicalised reaction labels. I agree with Greg that AAM is important. https://github.com/asad/ReactionDecoder http://www.ebi.ac.uk/thornton-srv/software/rbl/ Regards, Asad Sent from my iPhone > On 16 Dec 2016, at 14:42, Stephen Pickett <stephen.d.pick...@gsk.com> wrote: > > Thanks Greg, that’s clear. > > Stephen > > From: Greg Landrum [mailto:greg.land...@gmail.com] > Sent: 16 December 2016 14:33 > To: Stephen Pickett > Cc: rdkit-discuss@lists.sourceforge.net > Subject: Re: [Rdkit-discuss] Canonicalisation with reaction labels > > EXTERNAL > > Hi Stephen, > > The new canonicalization algorithm intentionally takes the atom-mapping > information into account. The logic is that the entire SMILES provided should > be canonical, so if the SMILES includes atom maps, those atom maps should be > considered while canonicalizing. > > If you have a molecule with atom maps and you would like the canonical SMILES > without the maps, you can do this (with the most recent version of the code): > > In [18]: mol = Chem.MolFromSmiles('C1CC([*:1])CCN1') > > In [19]: nmol = Chem.Mol(mol) > > In [20]: for at in nmol.GetAtoms(): at.SetAtomMapNum(0) > > In [21]: Chem.MolToSmiles(mol,True) > Out[21]: 'C1CC([*:1])CCN1' > > In [22]: Chem.MolToSmiles(nmol,True) > Out[22]: '[*]C1CCNCC1' > > A somewhat less clear (IMO) way of doing this that works in all versions is: > > In [27]: nmol = Chem.Mol(mol) > > In [28]: for at in nmol.GetAtoms(): at.ClearProp('molAtomMapNumber') > > In [29]: Chem.MolToSmiles(nmol,True) > Out[29]: '[*]C1CCNCC1' > > > I hope this helps, > -greg > > > > On Fri, Dec 16, 2016 at 1:55 PM, Stephen Pickett <stephen.d.pick...@gsk.com> > wrote: > Hi > > With a 2013 RDkit install we get consistent canonicalization between reaction > labelled and unlabelled atoms. > >>> mol = Chem.MolFromSmiles('C1CC([*])CCN1') > >>> Chem.MolToSmiles(mol) > '[*]C1CCNCC1' > >>> mol = Chem.MolFromSmiles('C1CC([*:1])CCN1') > >>> Chem.MolToSmiles(mol) > '[*:1]C1CCNCC1' > > In 2015-09 we are seeing differences. > >>> mol = Chem.MolFromSmiles('C1CC([*])CCN1') > >>> Chem.MolToSmiles(mol) > '[*]C1CCNCC1' > >>> mol = Chem.MolFromSmiles('C1CC([*:1])CCN1') > >>> Chem.MolToSmiles(mol) > 'C1CC([*:1])CCN1' > > I can understand why canonicalization can be different between versions but > I’m not sure whether this change in behaviour is expected? > I’m afraid that I don’t have ready access to a more recent install to test > this out. > > Thanks > > Stephen > > > This e-mail was sent by GlaxoSmithKline Services Unlimited > (registered in England and Wales No. 1047315), which is a > member of the GlaxoSmithKline group of companies. The > registered address of GlaxoSmithKline Services Unlimited > is 980 Great West Road, Brentford, Middlesex TW8 9GS. > GSK monitors email communications sent to and from GSK in order to protect > GSK, our employees, customers, suppliers and business partners, from cyber > threats and loss of GSK Information. GSK monitoring is conducted with > appropriate confidentiality controls and in accordance with local laws and > after appropriate consultation. > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > > > This e-mail was sent by GlaxoSmithKline Services Unlimited > (registered in England and Wales No. 1047315), which is a > member of the GlaxoSmithKline group of companies. The > registered address of GlaxoSmithKline Services Unlimited > is 980 Great West Road, Brentford, Middlesex TW8 9GS. > GSK monitors email communications sent to and from GSK in order to protect > GSK, our employees, customers, suppliers and business partners, from cyber > threats and loss of GSK Information. GSK monitoring is conducted with > appropriate confidentiality controls and in accordance with local laws and > after appropriate consultation. > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ >
Re: [Rdkit-discuss] Canonicalisation with reaction labels
Hi Stephen, The new canonicalization algorithm intentionally takes the atom-mapping information into account. The logic is that the entire SMILES provided should be canonical, so if the SMILES includes atom maps, those atom maps should be considered while canonicalizing. If you have a molecule with atom maps and you would like the canonical SMILES without the maps, you can do this (with the most recent version of the code): In [18]: mol = Chem.MolFromSmiles('C1CC([*:1])CCN1') In [19]: nmol = Chem.Mol(mol) In [20]: for at in nmol.GetAtoms(): at.SetAtomMapNum(0) In [21]: Chem.MolToSmiles(mol,True) Out[21]: 'C1CC([*:1])CCN1' In [22]: Chem.MolToSmiles(nmol,True) Out[22]: '[*]C1CCNCC1' A somewhat less clear (IMO) way of doing this that works in all versions is: In [27]: nmol = Chem.Mol(mol) In [28]: for at in nmol.GetAtoms(): at.ClearProp('molAtomMapNumber') In [29]: Chem.MolToSmiles(nmol,True) Out[29]: '[*]C1CCNCC1' I hope this helps, -greg On Fri, Dec 16, 2016 at 1:55 PM, Stephen Pickettwrote: > Hi > > > > With a 2013 RDkit install we get consistent canonicalization between > reaction labelled and unlabelled atoms. > > >>> mol = Chem.MolFromSmiles('C1CC([*])CCN1') > > >>> Chem.MolToSmiles(mol) > > '[*]C1CCNCC1' > > >>> mol = Chem.MolFromSmiles('C1CC([*:1])CCN1') > > >>> Chem.MolToSmiles(mol) > > '[*:1]C1CCNCC1' > > > > In 2015-09 we are seeing differences. > > >>> mol = Chem.MolFromSmiles('C1CC([*])CCN1') > > >>> Chem.MolToSmiles(mol) > > '[*]C1CCNCC1' > > >>> mol = Chem.MolFromSmiles('C1CC([*:1])CCN1') > > >>> Chem.MolToSmiles(mol) > > 'C1CC([*:1])CCN1' > > > > I can understand why canonicalization can be different between versions > but I’m not sure whether this change in behaviour is expected? > > I’m afraid that I don’t have ready access to a more recent install to test > this out. > > > > Thanks > > > > *Stephen* > > -- > > This e-mail was sent by GlaxoSmithKline Services Unlimited > (registered in England and Wales No. 1047315), which is a > member of the GlaxoSmithKline group of companies. The > registered address of GlaxoSmithKline Services Unlimited > is 980 Great West Road, Brentford, Middlesex TW8 9GS. > > *GSK monitors email communications sent to and from GSK in order to > protect GSK, our employees, customers, suppliers and business partners, > from cyber threats and loss of GSK Information. GSK monitoring is conducted > with appropriate confidentiality controls and in accordance with local laws > and after appropriate consultation.* > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Canonicalisation with reaction labels
On Dec 16, 2016, at 1:55 PM, Stephen Pickett wrote: > With a 2013 RDkit install we get consistent canonicalization between reaction > labelled and unlabelled atoms. > >>> mol = Chem.MolFromSmiles('C1CC([*])CCN1') > >>> Chem.MolToSmiles(mol) > '[*]C1CCNCC1' > >>> mol = Chem.MolFromSmiles('C1CC([*:1])CCN1') > >>> Chem.MolToSmiles(mol) > '[*:1]C1CCNCC1' 2013 RDKit didn't preserve the atom order between labeled and unlabeled atoms. It looked like it for many cases, but there were a few cases where the slight change to the initial atom invariants, caused by the atom label, ended up affecting the SMILES. I no longer have an older version of RDKit installed. Going through my notes, here was one of the failure cases: core => Cc1cc2c3c(c1)C[N@]([*])CCN(C)CC[N@@]([*])Cc1cc(C)cc(c1OCCCO3)C[N@@](C)CCN(C)CC[N@](C)C2 syntax=> Cc1cc2c3c(c1)C[N@]([*:1])CCN(C)CC[N@@]([*:2])Cc1cc(C)cc(c1OCCCO3)C[N@](C)CCN(C)CC[N@@](C)C2 canonical => Cc1cc2c3c(c1)C[N@]([*:2])CCN(C)CC[N@@]([*:1])Cc1cc(C)cc(c1OCCCO3)C[N@@](C)CCN(C)CC[N@](C)C2 For my project I ended up canonicalizing with unlabeled atoms, using the _smilesAtomOutputOrder to figure out where the "*" atoms were located in the SMILES string, use CanonicalRankAtoms() to figure out which were symmetrical, and come up with my own canonical labeling on top of the canonical unlabeled SMILES. > I can understand why canonicalization can be different between versions but > I’m not sure whether this change in behaviour is expected? While it is possible to generate a canonical labeling which preserves the same atom order as the canonical unlabeled SMILES (as I did above), that's more complicated. It's easier to include the label as part of the atom invariant and use the regular canonicalization mechanism. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss