Re: [Rdkit-discuss] invalid CTAB substructure query with PostgreSQL cartridge

2021-12-10 Thread Greg Landrum
Hi Susan,

I haven't looked at the Sgroup or unsaturation flags yet (I will try to do
this later today), but a word on the aromaticity/kekulization.

One of the things which is going wrong here at the cartridge level is that
qmol_from_ctab() is sanitizing the molecules it reads in. This is not
correct. qmol_from_ctab is intended to produce a query and queries should
not be sanitized. I will get that fixed and, assuming it doesn't turn up
other problems, we should be able to get that into the next patch release.
Here's that bug report:
https://github.com/rdkit/rdkit/issues/4787

There's another potential issue with how bonds from CTABs are parsed,
inspired by your first message, which we're discussing here:
https://github.com/rdkit/rdkit/issues/4785

Thanks for the very detailed descriptions of the problem!
-greg



On Fri, Dec 10, 2021 at 11:11 AM Susan Leung  wrote:

> Hi Paolo,
>
>
> Thanks very much for filing the bug and for offering the Python
> preprocessing solution.
>
>
> I actually have a few more CTABs that are not valid. One of which raises a
> kekulisation error and like the previous non-ring aromaton atom problem,
> there is an alternative SMARTS query that can be written. I suspect that
> you might suggest some Python preprocessing for this and converting to
> SMARTS?
>
>
> The other two errors, I don’t think are to do with sanitization but
> possibly due to the way the CTAB is read. One has multiple atoms with the
> unsaturated flag turned on. The other has a SGROUP defined.
>
> Please see below, where I tried to summarize and I again attach a ipynb if
> it helps.
>
>
> Thanks very much,
>
>
> Susan
>
>
> Example 2: Kekulization
>
> I want a query CTAB to match the following two tautomers.
>
> sm1 = 'Cc1c[nH]nc1C'
>
> sm2 = 'Cc1cn[nH]c1C'
>
>
>
> I would like to use the following CTAB as the query but it is not valid ,
> I’m guessing it’s because of kekulization, which is the error produced when
> doing MolFromMolBlock:
>
> ctab = """
>
>   ACCLDraw12082111532D
>
>
>
>   7  7  0  0  0  0  0  0  0  0999 V2000
>
> 9.2840  -12.13440. C   0  0  3  0  0  0  0  0  0  0  0  0
>
>10.2367  -11.44220. C   0  0  3  0  0  0  0  0  0  0  0  0
>
> 9.8729  -10.32170. N   0  0  0  0  0  0  0  0  0  0  0  0
>
> 8.6950  -10.32170. N   0  0  0  0  0  0  0  0  0  0  0  0
>
> 8.3309  -11.44220. C   0  0  3  0  0  0  0  0  0  0  0  0
>
> 7.1932  -11.74710. C   0  0  0  0  0  0  0  0  0  0  0  0
>
> 9.2840  -13.31220. C   0  0  0  0  0  0  0  0  0  0  0  0
>
>   2  1  4  0  0  0  0
>
>   2  3  4  0  0  0  0
>
>   3  4  4  0  0  0  0
>
>   4  5  4  0  0  0  0
>
>   5  1  4  0  0  0  0
>
>   5  6  1  0  0  0  0
>
>   7  1  1  0  0  0  0
>
> M  END
>
> """
>
> select is_valid_ctab('{ctab}')
>
>
>
> Returns False
>
> I can make an alternative valid CTAB with a hydrogen on one of the
> nitrogens that is valid, but then it doesn’t match both sm1 and sm2.
>
> ctab_fixed = """
>
>   ACCLDraw12082111272D
>
>
>
>   8  8  0  0  0  0  0  0  0  0999 V2000
>
>11.6590  -11.94690. C   0  0  3  0  0  0  0  0  0  0  0  0
>
>12.6117  -11.25470. C   0  0  3  0  0  0  0  0  0  0  0  0
>
>12.2479  -10.13420. N   0  0  3  0  0  0  0  0  0  0  0  0
>
>11.0700  -10.13420. N   0  0  0  0  0  0  0  0  0  0  0  0
>
>10.7059  -11.25470. C   0  0  3  0  0  0  0  0  0  0  0  0
>
> 9.5682  -11.55960. C   0  0  0  0  0  0  0  0  0  0  0  0
>
>11.6590  -13.12470. C   0  0  0  0  0  0  0  0  0  0  0  0
>
>12.8368   -9.11420. H   0  0  0  0  0  0  0  0  0  0  0  0
>
>   2  1  4  0  0  0  0
>
>   2  3  4  0  0  0  0
>
>   3  4  4  0  0  0  0
>
>   4  5  4  0  0  0  0
>
>   5  1  4  0  0  0  0
>
>   5  6  1  0  0  0  0
>
>   7  1  1  0  0  0  0
>
>   3  8  1  0  0  0  0
>
> M  END
>
> """
>
>
>
> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed}
>
>
>
> Returns True
>
> select mol_from_smiles('{sm2}') @> qmol_from_ctab('{ctab_fixed}')
>
>
>
> Returns False
>
>
>
> However, I can make a qmol from SMARTS that can match with both:
>
> alt_smarts = '[#6]1(:[#6]:[#7]:[#7]:[#6]:1-[#6])-[#6]'
>
>
>
> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}')
>
>
>
>  select mol_from_smiles('{sm2}') @> qmol_from_smarts('{alt_smarts}'
>
>
>
> Return True for both.
>
> ___
>
> Example 3 Unsaturated :
>
> How does RDKit handle the M  UNS line? It can’t seem to handle this CTAB
> for example, where multiple atoms (atom 8 and atom 9) have the unsaturated
> flag on.
>
> ctab_og = """
>
>   ACCLDraw12082113482D
>
>
>
>   9  9  0  0  0  0  0  0  0  0999 V2000
>
> 2.6030  -22.36750. C   0  0  0  0  0  0  0  0  0  0  0  0
>
> 3.6258  -21.77730. C   0  0  0  0  0  0  0  0  0  0  0  0
>
> 4.6447  -22.36690. C   0  0  0  0  0  0  0  0  0  0  0  0
>
> 4.6447  -23.54800. C   0  0  0  0  0  0  0  0  0  0  0  0
>
> 

Re: [Rdkit-discuss] invalid CTAB substructure query with PostgreSQL cartridge

2021-12-09 Thread Paolo Tosco
Hi Susan,

that looks like a bug in the way the MDL query is parsed; I have filed it
here:
https://github.com/rdkit/rdkit/issues/4785

If you can afford doing some Python massaging to your CTAB queries and
converting them to SMARTS before submitting them to PostgreSQL when they
fail sanitization, the following should work:

mol_og = Chem.MolFromMolBlock(ctab_og, sanitize=False)
try:
Chem.SanitizeMol(mol_og)
cur = conn.cursor()
cur.execute(f"""select mol_from_smiles('{sm1}') @>
qmol_from_ctab('{ctab_og}')""")
rows = cur.fetchall()
print(rows)
except Chem.AtomKekulizeException as e:
if re.match(r"non-ring atom \d+ marked aromatic", str(e)):
Chem.FastFindRings(mol_og)
rwmol_og = Chem.RWMol(mol_og)
for a in mol_og.GetAtoms():
if a.GetIsAromatic() and not a.IsInRing():
rwmol_og.ReplaceAtom(a.GetIdx(),
rdqueries.IsAromaticQueryAtom())
try:
Chem.SanitizeMol(rwmol_og)
smarts_og = Chem.MolToSmarts(rwmol_og)
cur = conn.cursor()
cur.execute(f"""select mol_from_smiles('{sm1}') @>
mol_from_smarts('{smarts_og}')""")
rows = cur.fetchall()
print(rows)
except:
...

HTH, cheers
p.


On Thu, Dec 9, 2021 at 5:32 PM Susan Leung  wrote:

> Hi all,
>
>
>
> I am trying to do some substructure queries using the RDKit PostgreSQL
> cartridge. Specifically, my queries substructure inputs are CTAB (not
> SMARTS) so I would like to use qmol_from_ctab. However, I have some
> problems with making valid query molecules with a few CTABs.
>
>
>
> In this query, I try to use a CTAB to make a query to search for aryl
> boronate acid/ester. I can make an equivalent query using SMARTS but the
> CTAB is not valid.
>
>
> As far as I am aware, there's no warning message when using the SQL
> functions, so I use MolFromMolBlock from python and get "non-ring atom 0
> marked aromatic" so I correct the aromatic bond type to double bond and
> the CTAB can be read in (but that's not the query I want). I am guessing
> that there may be additional validity checks / sanitization steps when
> executing qmol_from_ctab vs qmol_from_smarts? As far as I can see, there’s
> no flag in qmol_from_ctab.
>
>
>
> I describe the general problem below but also attach the ipynb (if it is
> useful) that uses psycopg2 to do the SQL , leaving out the database
> connection credentials.
>
>
>
> Many thanks,
>
>
>
> Susan
>
> __
>
>
>
> For example, I want to match an aromatic boronic acid:
>
> sm1 = 'OB(O)c1c1'
>
>
>
> But the following CTAB isn’t valid. MolFromFromBlock returns non-ring atom
> marked aromatic error so I suspect it’s to do with that. Also changing the
> bond marked aromatic ‘4’ to a double bond ‘2’ makes the ctab valid.
>
> ctab_og = """Boronate acid/ester(aryl)
>
>   SciTegic12012112112D
>
>
>
>   5  4  0  0  0  0999 V2000
>
> 1.7243   -2.73240. A   0  0
>
> 2.7559   -2.14560. C   0  0
>
> 3.7808   -2.73240. B   0  0
>
> 4.8057   -2.14560. O   0  0
>
> 3.7808   -3.91900. O   0  0
>
>   1  2  4  0  0  1  0
>
>   2  3  1  0
>
>   3  4  1  0
>
>   3  5  1  0
>
> M  END
>
> > 
>
> Boronate acid/ester(aryl)
>
>
>
> """
>
> ctab_fixed = """Boronate acid/ester(aryl)
>
>   SciTegic12012112112D
>
>
>
>   5  4  0  0  0  0999 V2000
>
> 1.7243   -2.73240. A   0  0
>
> 2.7559   -2.14560. C   0  0
>
> 3.7808   -2.73240. B   0  0
>
> 4.8057   -2.14560. O   0  0
>
> 3.7808   -3.91900. O   0  0
>
>   1  2  2  0  0  1  0
>
>   2  3  1  0
>
>   3  4  1  0
>
>   3  5  1  0
>
> M  END
>
> > 
>
> Boronate acid/ester(aryl)
>
>
>
> """
>
> select is_valid_ctab('{ctab_og}')
>
>
>
> Returns False
>
> select is_valid_ctab('{ctab_fixed}')
>
> Returns True
>
>
>
> However, I can make a qmol using SMARTS match sm1. Is there of making the
> query CTAB valid so we don’t have to use SMARTS?
>
> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed}')
>
>
>
> Returns False
>
>
>
> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}')
>
>
>
> Returns True
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss