A minor correction: [H] by itself *is* valid and means a hydrogen atom. The
Daylight docs say as much in section 4.1. But in other contexts it means a
hydrogen count, so to be safe, always using #1 to mean a hydrogen atom can
be a good practice.
If you are ever in doubt about how RDKit is interpreting a SMARTS,
I recommend making use of the DescribeQuery function which provides a tree
representation of a query atom or bond. For example (comments added):
>>> mol = Chem.MolFromSmarts('[H][N,H][N,#1]')
>>> print(mol.GetAtomWithIdx(0).DescribeQuery()) # [H]
AtomAtomicNum 1 = val # [H] interpreted as a hydrogen atom
>>> print(mol.GetAtomWithIdx(1).DescribeQuery()) # [N,H]
AtomOr
AtomType 7 = val
AtomHCount 1 = val # H interpreted as a hydrogen count
# Overall query atom means "an aliphatic nitrogen OR (any atom with one
hydrogen)!
>>> print(mol.GetAtomWithIdx(2).DescribeQuery()) # [N,#1]
AtomOr
AtomType 7 = val
AtomAtomicNum 1 = val # "#1" is atomic number, therefore a hydrogen atom.
# Overall query atom means "an aliphatic nitrogen OR a hydrogen"
One non-obvious convention in the DescribeQuery output is that AtomType
implies aliphatic when the value is a normal atomic number, or aromatic if
the atomic number is offset by 1000. For example, [n] is "AtomType 1007".
Hope you find this approach useful in the future.
Ivan
On Tue, Mar 1, 2022 at 6:33 AM David Cosgrove
wrote:
> Hi Adam
> There are a number of issues here. The key one, I think, is a
> misunderstanding about the meaning of H in SMARTS. It means "a single
> attached hydrogen", and is a qualifier for another atom, it cannot be used
> by itself. So [*H] is valid, [H] isn't. See the table at
> https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html. If you
> want to refer to an explicit hydrogen, you have to use [#1]. However, that
> will only match an explicit hydrogen in the molecule, not an implicit one.
> Thus c[#1] doesn't match anything in c1c1. If you have read in a
> molecule from a molfile, for example, that has explicit hydrogens then you
> will be ok.
>
> Further to that, your SMARTS strings, at least as they have appeared in
> gmail, which may have garbled them, are incorrect. In S1, the brackets
> round [N,n,H] make it a substituent, so it will not match the indole
> nitrogen. Also, it would probably be better as [N,n;H], which would be
> read as "(aliphatic nitrogen OR aromatic nitrogen) AND 1 attached
> hydrogen." The [N,n,H] will match a methylated indole nitrogen which I
> imagine is not what you want. Similar remarks apply to S2. A SMARTS that
> matches both 6CI and PCT
> is [C,c]1(Cl)[C,c][C,c;H][C,c]([C,c])[C,c;H][C,c]1, but that won't match
> the H atoms themselves if you want to use them in the overlay, and it also
> won't work in the aliphatic case of, for example, ClC1CCC(C)CC1 because
> there the carbon atoms have 2 attached hydrogens. If you really do want
> it to match aliphatic cases as well, then you will need something
> like
> [C,c]1(Cl)[$([CH2]),$([cH])][$([CH2]),$([cH])][C,c]([C,c])[$([CH2]),$([cH])][$([CH2]),$([cH])]1
> which is quite a mouthful. The carbons at the 2,3,5 and 6 positions on the
> ring are specified as either [CH2] or [cH].
>
> Jupyter notebook can be really useful for debugging SMARTS patterns like
> this. The one I used was variations of
> ```
> from rdkit import Chem
> from IPython.display import SVG
> mol = Chem.MolFromSmiles('C1=CC(=CC2=C1C=CN2C)Cl')
> qmol = Chem.MolFromSmarts('[C,c]1(Cl)[C,c][C,c][C,c]([C,c])[C,c][C,c]1')
> print(mol.GetSubstructMatches(qmol))
> mol
> ```
> which prints the numbers of the matching atoms and also draws the molecule
> with the match highlighted.
> Regards,
> Dave
>
>
> On Tue, Mar 1, 2022 at 1:43 AM Adam Moyer wrote:
>
>> Hello,
>>
>> I have a baffling case where I am trying to match substructures on two
>> ligands for the goal of aligning them.
>>
>> I have two ligands; one is a 6-chloroindole (6CI) and the other is a
>> para-chloro toluene (PCT).
>>
>> I am attempting to use the following SMARTS (S1) to match
>> them: '[C,c]1(Cl)[C,c][C,c]*([N,n,H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'.
>> For some reason S1 only finds a match in 6CI.
>>
>> When I use the following SMARTS (S2) I only match to PCT as expected:
>> '[C,c]1(Cl)[C,c][C,c]*([H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'.
>>
>> How can S1 not match PCT? S1 is strictly a superset of S2 because I am
>> using the "or" operation. Do I have a misunderstanding of how explicit
>> hydrogens work in RDKit/SMARTS?
>>
>> Lastly when I use the last SMARTS (S3) I am able to match to both, but I
>> cannot use that smarts due to other requirements in my
>> project: '[C,c]1(Cl)[C,c][C,c][C,c]([C,c,H])[C,c]([H])[C,c]1'
>>
>> Thanks!
>> Adam
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> David Cosgrove
> Freelance computational