Re: [Rdkit-discuss] Question matching substructures from SMARTS with explicit hydrogens

2022-03-01 Thread Ivan Tubert-Brohman
A minor correction: [H] by itself *is* valid and means a hydrogen atom. The
Daylight docs say as much in section 4.1. But in other contexts it means a
hydrogen count, so to be safe, always using #1 to mean a hydrogen atom can
be a good practice.

If you are ever in doubt about how RDKit is interpreting a SMARTS,
I recommend making use of the DescribeQuery function which provides a tree
representation of a query atom or bond. For example (comments added):

>>> mol = Chem.MolFromSmarts('[H][N,H][N,#1]')


>>> print(mol.GetAtomWithIdx(0).DescribeQuery())  # [H]

AtomAtomicNum 1 = val  # [H] interpreted as a hydrogen atom

>>> print(mol.GetAtomWithIdx(1).DescribeQuery())  # [N,H]

AtomOr
  AtomType 7 = val
  AtomHCount 1 = val  # H interpreted as a hydrogen count
# Overall query atom means "an aliphatic nitrogen OR (any atom with one
hydrogen)!

>>> print(mol.GetAtomWithIdx(2).DescribeQuery())   # [N,#1]

AtomOr
  AtomType 7 = val
  AtomAtomicNum 1 = val  # "#1" is atomic number, therefore a hydrogen atom.
# Overall query atom means "an aliphatic nitrogen OR a hydrogen"

One non-obvious convention in the DescribeQuery output is that AtomType
implies aliphatic when the value is a normal atomic number, or aromatic if
the atomic number is offset by 1000. For example, [n] is "AtomType 1007".

Hope you find this approach useful in the future.

Ivan

On Tue, Mar 1, 2022 at 6:33 AM David Cosgrove 
wrote:

> Hi Adam
> There are a number of issues here.  The key one, I think, is a
> misunderstanding about the meaning of H in SMARTS.  It means "a single
> attached hydrogen", and is a qualifier for another atom, it cannot be used
> by itself.  So [*H] is valid, [H] isn't.  See the table at
> https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html.  If you
> want to refer to an explicit hydrogen, you have to use [#1].  However, that
> will only match an explicit hydrogen in the molecule, not an implicit one.
> Thus c[#1] doesn't match anything in c1c1.  If you have read in a
> molecule from a molfile, for example, that has explicit hydrogens then you
> will be ok.
>
> Further to that, your SMARTS strings, at least as they have appeared in
> gmail, which may have garbled them, are incorrect.  In S1, the brackets
> round [N,n,H] make it a substituent, so it will not match the indole
> nitrogen.  Also, it would probably be better as [N,n;H], which would be
> read as "(aliphatic nitrogen OR aromatic nitrogen) AND 1 attached
> hydrogen."  The [N,n,H] will match a methylated indole nitrogen which I
> imagine is not what you want. Similar remarks apply to S2.  A SMARTS that
> matches both 6CI and PCT
> is [C,c]1(Cl)[C,c][C,c;H][C,c]([C,c])[C,c;H][C,c]1, but that won't match
> the H atoms themselves if you want to use them in the overlay, and it also
> won't work in the aliphatic case of, for example, ClC1CCC(C)CC1 because
> there the carbon atoms have 2 attached hydrogens.   If you really do want
> it to match aliphatic cases as well, then you will need something
> like 
> [C,c]1(Cl)[$([CH2]),$([cH])][$([CH2]),$([cH])][C,c]([C,c])[$([CH2]),$([cH])][$([CH2]),$([cH])]1
> which is quite a mouthful.  The carbons at the 2,3,5 and 6 positions on the
> ring are specified as either [CH2] or [cH].
>
> Jupyter notebook can be really useful for debugging SMARTS patterns like
> this.  The one I used was variations of
> ```
> from rdkit import Chem
> from IPython.display import SVG
> mol = Chem.MolFromSmiles('C1=CC(=CC2=C1C=CN2C)Cl')
> qmol = Chem.MolFromSmarts('[C,c]1(Cl)[C,c][C,c][C,c]([C,c])[C,c][C,c]1')
> print(mol.GetSubstructMatches(qmol))
> mol
> ```
> which prints the numbers of the matching atoms and also draws the molecule
> with the match highlighted.
> Regards,
> Dave
>
>
> On Tue, Mar 1, 2022 at 1:43 AM Adam Moyer  wrote:
>
>> Hello,
>>
>> I have a baffling case where I am trying to match substructures on two
>> ligands for the goal of aligning them.
>>
>> I have two ligands; one is a 6-chloroindole (6CI) and the other is a
>> para-chloro toluene (PCT).
>>
>> I am attempting to use the following SMARTS (S1) to match
>> them: '[C,c]1(Cl)[C,c][C,c]*([N,n,H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'.
>> For some reason S1 only finds a match in 6CI.
>>
>> When I use the following SMARTS (S2) I only match to PCT as expected:
>> '[C,c]1(Cl)[C,c][C,c]*([H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'.
>>
>> How can S1 not match PCT? S1 is strictly a superset of S2 because I am
>> using the "or" operation. Do I have a misunderstanding of how explicit
>> hydrogens work in RDKit/SMARTS?
>>
>> Lastly when I use the last SMARTS (S3) I am able to match to both, but I
>> cannot use that smarts due to other requirements in my
>> project: '[C,c]1(Cl)[C,c][C,c][C,c]([C,c,H])[C,c]([H])[C,c]1'
>>
>> Thanks!
>> Adam
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> David Cosgrove
> Freelance computational 

Re: [Rdkit-discuss] Question matching substructures from SMARTS with explicit hydrogens

2022-03-01 Thread David Cosgrove
Hi Adam
There are a number of issues here.  The key one, I think, is a
misunderstanding about the meaning of H in SMARTS.  It means "a single
attached hydrogen", and is a qualifier for another atom, it cannot be used
by itself.  So [*H] is valid, [H] isn't.  See the table at
https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html.  If you
want to refer to an explicit hydrogen, you have to use [#1].  However, that
will only match an explicit hydrogen in the molecule, not an implicit one.
Thus c[#1] doesn't match anything in c1c1.  If you have read in a
molecule from a molfile, for example, that has explicit hydrogens then you
will be ok.

Further to that, your SMARTS strings, at least as they have appeared in
gmail, which may have garbled them, are incorrect.  In S1, the brackets
round [N,n,H] make it a substituent, so it will not match the indole
nitrogen.  Also, it would probably be better as [N,n;H], which would be
read as "(aliphatic nitrogen OR aromatic nitrogen) AND 1 attached
hydrogen."  The [N,n,H] will match a methylated indole nitrogen which I
imagine is not what you want. Similar remarks apply to S2.  A SMARTS that
matches both 6CI and PCT
is [C,c]1(Cl)[C,c][C,c;H][C,c]([C,c])[C,c;H][C,c]1, but that won't match
the H atoms themselves if you want to use them in the overlay, and it also
won't work in the aliphatic case of, for example, ClC1CCC(C)CC1 because
there the carbon atoms have 2 attached hydrogens.   If you really do want
it to match aliphatic cases as well, then you will need something
like 
[C,c]1(Cl)[$([CH2]),$([cH])][$([CH2]),$([cH])][C,c]([C,c])[$([CH2]),$([cH])][$([CH2]),$([cH])]1
which is quite a mouthful.  The carbons at the 2,3,5 and 6 positions on the
ring are specified as either [CH2] or [cH].

Jupyter notebook can be really useful for debugging SMARTS patterns like
this.  The one I used was variations of
```
from rdkit import Chem
from IPython.display import SVG
mol = Chem.MolFromSmiles('C1=CC(=CC2=C1C=CN2C)Cl')
qmol = Chem.MolFromSmarts('[C,c]1(Cl)[C,c][C,c][C,c]([C,c])[C,c][C,c]1')
print(mol.GetSubstructMatches(qmol))
mol
```
which prints the numbers of the matching atoms and also draws the molecule
with the match highlighted.
Regards,
Dave


On Tue, Mar 1, 2022 at 1:43 AM Adam Moyer  wrote:

> Hello,
>
> I have a baffling case where I am trying to match substructures on two
> ligands for the goal of aligning them.
>
> I have two ligands; one is a 6-chloroindole (6CI) and the other is a
> para-chloro toluene (PCT).
>
> I am attempting to use the following SMARTS (S1) to match
> them: '[C,c]1(Cl)[C,c][C,c]*([N,n,H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'.
> For some reason S1 only finds a match in 6CI.
>
> When I use the following SMARTS (S2) I only match to PCT as expected:
> '[C,c]1(Cl)[C,c][C,c]*([H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'.
>
> How can S1 not match PCT? S1 is strictly a superset of S2 because I am
> using the "or" operation. Do I have a misunderstanding of how explicit
> hydrogens work in RDKit/SMARTS?
>
> Lastly when I use the last SMARTS (S3) I am able to match to both, but I
> cannot use that smarts due to other requirements in my
> project: '[C,c]1(Cl)[C,c][C,c][C,c]([C,c,H])[C,c]([H])[C,c]1'
>
> Thanks!
> Adam
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss