Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens
Thank you both Greg & Markus – I`ll happily wait for it to appear in conda in the near future ☺ Alex From: Markus Sitzmann [mailto:markus.sitzm...@gmail.com] Sent: Wednesday, November 23, 2016 3:40 PM To: Alexander Klenner-Bajaja Cc: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens If I understood Greg correctly, it will be in 2016.09 which isn't in conda just of yet, they are currently working on putting it there. Markus - | Markus Sitzmann | markus.sitzm...@gmail.com<mailto:markus.sitzm...@gmail.com> On 23 Nov 2016, at 15:29, Alexander Klenner-Bajaja <aklen...@epo.org<mailto:aklen...@epo.org>> wrote: Dear Greg, Thank you very much, looking at the results that function was exactly what I was looking for – only I can’t find it in my updated anaconda installation. “conda update rdkit” tells me I have the latest version 2016.03.4 and postgres tells me I have the 3.4 version of the RDKit extension If I understand your blog post correctly it should be in 2016.03 version? What am I missing? Best, Alex From: Greg Landrum [mailto:greg.land...@gmail.com] Sent: Wednesday, November 23, 2016 11:42 AM To: Alexander Klenner-Bajaja Cc: rdkit-discuss@lists.sourceforge.net<mailto:rdkit-discuss@lists.sourceforge.net> Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens Hi Alex, The new version of the cartridge has some capabilities that, I think, address this. There's a blog post about this: http://rdkit.blogspot.com/2016/07/tuning-substructure-queries-ii.html but the short version is that you can do the kind of queries it seems like you want to do quite simply: chembl_21=# select * from rdk.mols where m@>mol_adjust_query_properties('*c1ncccn1') limit 3; molregno | m --+--- 601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1 289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O 607646 | CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)cc4)ncnc32)[C@@H](O)[C@H]1O (3 rows) chembl_21=# select * from rdk.mols where m@>mol_adjust_query_properties('*c1nc(*)ccn1') limit 3; molregno | m --+--- 158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1 158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1 158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1 (3 rows) chembl_21=# select * from rdk.mols where m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1') limit 3; molregno |m --+-- 726443 | CN=C(S)NNc1nc(C)cc(C)n1 561136 | C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)cc1)c1(Br)c1 205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1 (3 rows) There's more detail in the blog post, but the default behavior is to convert dummies into generic query atoms and to constrain the substitution at any other *ring* position. Best Regards, -greg On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja <aklen...@epo.org<mailto:aklen...@epo.org>> wrote: Hi all, I am currently exploring the possibilities of the RDKit database cartridge for substructure search- I installed everything following the tutorial from http://www.rdkit.org/docs/Install.html Very nice tutorial - worked perfectly fine. Since we are exploring solutions for browser based gui searches I created a test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) which communicates with the database through PHP. Ketcher returns a SMILES representation from the drawn molecule. The raw data of the molecules in the database are canonical SMILES created from RDKIT canonical SMILES from the rdkit KNIME node (they are text-mined from patents). When doing substructure searches, as long as we query for well-defined compounds the results make sense – however looking at R1,…-groups things get a little odd. I found a very old discussion on the mailing list from 2009 where this has been discussed and I understood from that dialog that when looking at SMILES with a “*” representation this is interpreted as a dummy atom and the same dummy atom is expected in the search space to produce a hit. While a SMARTS representation of the same string actually leads to the behaviour that “any atom” is matched at that position. I ended up with the very cumbersome query, I am sure there are more elegant ways of doing this using ::qmol notation, but as I said I am currently exploring ☺ That’s the query (in PHP) in question for PostgreSQL: $search_result = pg_query($dbconn, "select m from pat.mols where m@>mol_from_smarts
Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens
If I understood Greg correctly, it will be in 2016.09 which isn't in conda just of yet, they are currently working on putting it there. Markus - | Markus Sitzmann | markus.sitzm...@gmail.com > On 23 Nov 2016, at 15:29, Alexander Klenner-Bajaja <aklen...@epo.org> wrote: > > Dear Greg, > > Thank you very much, looking at the results that function was exactly what I > was looking for – only I can’t find it in my updated anaconda installation. > > “conda update rdkit” tells me I have the latest version 2016.03.4 and > postgres tells me I have the 3.4 version of the RDKit extension > > If I understand your blog post correctly it should be in 2016.03 version? > What am I missing? > > > Best, > > Alex > > > > From: Greg Landrum [mailto:greg.land...@gmail.com] > Sent: Wednesday, November 23, 2016 11:42 AM > To: Alexander Klenner-Bajaja > Cc: rdkit-discuss@lists.sourceforge.net > Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit > hydrogens > > Hi Alex, > > The new version of the cartridge has some capabilities that, I think, address > this. > > There's a blog post about this: > http://rdkit.blogspot.com/2016/07/tuning-substructure-queries-ii.html > but the short version is that you can do the kind of queries it seems like > you want to do quite simply: > > chembl_21=# select * from rdk.mols where > m@>mol_adjust_query_properties('*c1ncccn1') limit 3; > molregno | m > > --+--- >601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1 >289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O >607646 | > CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)cc4)ncnc32)[C@@H](O)[C@H]1O > (3 rows) > > chembl_21=# select * from rdk.mols where > m@>mol_adjust_query_properties('*c1nc(*)ccn1') limit 3; > molregno | m > --+--- >158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1 >158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1 >158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1 > (3 rows) > > chembl_21=# select * from rdk.mols where > m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1') limit 3; > molregno |m > > --+-- >726443 | CN=C(S)NNc1nc(C)cc(C)n1 >561136 | > C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)cc1)c1(Br)c1 >205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1 > (3 rows) > > There's more detail in the blog post, but the default behavior is to convert > dummies into generic query atoms and to constrain the substitution at any > other *ring* position. > > Best Regards, > -greg > > > On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja <aklen...@epo.org> > wrote: > Hi all, > > I am currently exploring the possibilities of the RDKit database cartridge > for substructure search- I installed everything following the tutorial from > http://www.rdkit.org/docs/Install.html > > Very nice tutorial - worked perfectly fine. > > Since we are exploring solutions for browser based gui searches I created a > test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) > which communicates with the database through PHP. > > Ketcher returns a SMILES representation from the drawn molecule. The raw data > of the molecules in the database are canonical SMILES created from RDKIT > canonical SMILES from the rdkit KNIME node (they are text-mined from patents). > > When doing substructure searches, as long as we query for well-defined > compounds the results make sense – however looking at R1,…-groups things get > a little odd. > > I found a very old discussion on the mailing list from 2009 where this has > been discussed and I understood from that dialog that when looking at SMILES > with a “*” representation this is interpreted as a dummy atom and the same > dummy atom is expected in the search space to produce a hit. While a SMARTS > representation of the same string actually leads to the behaviour that “any > atom” is matched at that position. > > I ended up with the very cumbersome query, I am sure there are more elegant > ways of doing this using ::qmol notation, but as I said I am currently >
Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens
Alex, I'm glad that looks right. Unfortunately those changes are in the 2016.09 version of the RDKit, which was just finalized today. We haven't completed the anaconda builds for that yet. -greg On Wed, Nov 23, 2016 at 3:29 PM, Alexander Klenner-Bajaja <aklen...@epo.org> wrote: > Dear Greg, > > > > Thank you very much, looking at the results that function was exactly what > I was looking for – only I can’t find it in my updated anaconda > installation. > > > > “conda update rdkit” tells me I have the latest version 2016.03.4 and > postgres tells me I have the 3.4 version of the RDKit extension > > > > If I understand your blog post correctly it should be in 2016.03 version? > What am I missing? > > > > > > Best, > > > > Alex > > > > > > > > *From:* Greg Landrum [mailto:greg.land...@gmail.com] > *Sent:* Wednesday, November 23, 2016 11:42 AM > *To:* Alexander Klenner-Bajaja > *Cc:* rdkit-discuss@lists.sourceforge.net > *Subject:* Re: [Rdkit-discuss] smarts vs smiles database queries and > explicit hydrogens > > > > Hi Alex, > > > > The new version of the cartridge has some capabilities that, I think, > address this. > > > > There's a blog post about this: http://rdkit.blogspot.com/2016 > /07/tuning-substructure-queries-ii.html > > but the short version is that you can do the kind of queries it seems like > you want to do quite simply: > > > > chembl_21=# select * from rdk.mols where > m@>mol_adjust_query_properties('*c1ncccn1') > limit 3; > > molregno | m > > > --+- > -- > >601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1 > >289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O > >607646 | CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)cc > 4)ncnc32)[C@@H](O)[C@H]1O > > (3 rows) > > > > chembl_21=# select * from rdk.mols where > m@>mol_adjust_query_properties('*c1nc(*)ccn1') > limit 3; > > molregno | m > > --+--- > >158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1 > >158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1 > >158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1 > > (3 rows) > > > > chembl_21=# select * from rdk.mols where > m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1') > limit 3; > > molregno |m > > > --+- > - > >726443 | CN=C(S)NNc1nc(C)cc(C)n1 > >561136 | C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)cc1) > c1(Br)c1 > >205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1 > > (3 rows) > > > > There's more detail in the blog post, but the default behavior is to > convert dummies into generic query atoms and to constrain the substitution > at any other *ring* position. > > > > Best Regards, > > -greg > > > > > > On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja < > aklen...@epo.org> wrote: > > Hi all, > > > > I am currently exploring the possibilities of the RDKit database cartridge > for substructure search- I installed everything following the tutorial > from http://www.rdkit.org/docs/Install.html > > > > Very nice tutorial - worked perfectly fine. > > > > Since we are exploring solutions for browser based gui searches I created > a test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) > which communicates with the database through PHP. > > > > Ketcher returns a SMILES representation from the drawn molecule. The raw > data of the molecules in the database are canonical SMILES created from > RDKIT canonical SMILES from the rdkit KNIME node (they are text-mined from > patents). > > > > When doing substructure searches, as long as we query for well-defined > compounds the results make sense – however looking at R1,…-groups things > get a little odd. > > > > I found a very old discussion on the mailing list from 2009 where this has > been discussed and I understood from that dialog that when looking at > SMILES with a “*” representation this is interpreted as a dummy atom and > the same dummy atom is expected in the search space to produce a hit. While > a SMARTS representation of the same string actually leads to the behaviour > that “any atom” is matched at that position. > >
Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens
Hi Alex, The new version of the cartridge has some capabilities that, I think, address this. There's a blog post about this: http://rdkit.blogspot.com/ 2016/07/tuning-substructure-queries-ii.html but the short version is that you can do the kind of queries it seems like you want to do quite simply: chembl_21=# select * from rdk.mols where m@>mol_adjust_query_properties('*c1ncccn1') limit 3; molregno | m --+- -- 601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1 289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O 607646 | CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5) cc4)ncnc32)[C@@H](O)[C@H]1O (3 rows) chembl_21=# select * from rdk.mols where m@>mol_adjust_query_properties('*c1nc(*)ccn1') limit 3; molregno | m --+--- 158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1 158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1 158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1 (3 rows) chembl_21=# select * from rdk.mols where m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1') limit 3; molregno |m --+- - 726443 | CN=C(S)NNc1nc(C)cc(C)n1 561136 | C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl) cc1)c1(Br)c1 205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1 (3 rows) There's more detail in the blog post, but the default behavior is to convert dummies into generic query atoms and to constrain the substitution at any other *ring* position. Best Regards, -greg On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajajawrote: > Hi all, > > > > I am currently exploring the possibilities of the RDKit database cartridge > for substructure search- I installed everything following the tutorial > from http://www.rdkit.org/docs/Install.html > > > > Very nice tutorial - worked perfectly fine. > > > > Since we are exploring solutions for browser based gui searches I created > a test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) > which communicates with the database through PHP. > > > > Ketcher returns a SMILES representation from the drawn molecule. The raw > data of the molecules in the database are canonical SMILES created from > RDKIT canonical SMILES from the rdkit KNIME node (they are text-mined from > patents). > > > > When doing substructure searches, as long as we query for well-defined > compounds the results make sense – however looking at R1,…-groups things > get a little odd. > > > > I found a very old discussion on the mailing list from 2009 where this has > been discussed and I understood from that dialog that when looking at > SMILES with a “*” representation this is interpreted as a dummy atom and > the same dummy atom is expected in the search space to produce a hit. While > a SMARTS representation of the same string actually leads to the behaviour > that “any atom” is matched at that position. > > > > I ended up with the very cumbersome query, I am sure there are more > elegant ways of doing this using ::qmol notation, but as I said I am > currently exploring J > > > > That’s the query (in PHP) in question for PostgreSQL: > > > > *$search_result = pg_query($dbconn, "select m from pat.mols where > m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "'))) > LIMIT 20;"); * > > > > Extracting rdkit functionality leaves me with: > > > > *m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. > "')))* > > and adding a smiles string to make it more readable: > > > > *m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('* C([*])1=CC=CC=C1*'))) > (This is how Ketcher creates the smiles string, using explicit double > bonds)* > > > > This query does actually work and returns structures that are correct > (visually inspected a few examples) > > > > The same query without all the molecule conversion methods does not return > anything > > > > *m@>'* C([*])1=CC=CC=C1*'* > > > > I guess the reason for this is that the default interpretation is smiles > and it is looking for actual dummy atoms in the database (there are none). > > > > That’s my first question: Is this assumption correct? > > > > My next issue is a query with explicit hydrogens: > > > > Using > > > > *“C([*])1=C([H])C([H])=C([H])C([H])=C1[H]” * > > > > as a query with the all the molecule conversion as shown above to make > SMARTS happen, returns among others: > > > > *“C(C)1=CC=C(C)C=C1”* > > > > Which is correct for implicit hydrogens but not for explicit – so my guess > is they are lost. > > > > Can I enforce at query time against the cartridge to work with explicit > hydrogens so that only molecules are found that have different substitutes > at the