On Oct 26, 2020, at 17:41, Cyrus Maher <cma...@vir.bio> wrote:
> I’m wondering if there is an easy way to retrieve the atom numbers that the 
> morgan fingerprints algorithm assigns as its first step.

Many of the fingerprint function support an optional "bitInfo" parameter. If 
it's a dictionary then the keys are the bit that was set, and the value is at 
tuple of the (atom index, radius) which set it.

Here's an example with theobromine using r=0, which lets you see the initial 
invariants:

>>> from rdkit import Chem
>>> from rdkit.Chem import rdMolDescriptors
>>> mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)[nH]c(=O)n2C")
>>> bitInfo = {}
>>> fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=0, 
>>> useFeatures=1, bitInfo=bitInfo)
>>> for bitno, pairs in sorted(bitInfo.items()):
...   print(f"Bitno: {bitno}")
...   for atom_idx, r in pairs:
...     print(f"  atom {atom_idx} ({mol.GetAtomWithIdx(atom_idx).GetSymbol()}) 
with radius {r}")
...
Bitno: 0
  atom 0 (C) with radius 0
  atom 12 (C) with radius 0
Bitno: 2
  atom 7 (O) with radius 0
  atom 10 (O) with radius 0
Bitno: 4
  atom 2 (C) with radius 0
  atom 4 (C) with radius 0
  atom 5 (C) with radius 0
  atom 6 (C) with radius 0
  atom 9 (C) with radius 0
Bitno: 5
  atom 8 (N) with radius 0
Bitno: 6
  atom 1 (N) with radius 0
  atom 3 (N) with radius 0
  atom 11 (N) with radius 0

If I follow the code correctly, when useFeatures == 1 then the intial 
invariants are set by getFeatureInvariants() in 
./Code/GraphMol/Fingerprints/FingerprintUtil.cpp , available at:

https://github.com/rdkit/rdkit/blob/d594998dda2803a15aa0066e06f2477b71ed3be6/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L221

A few lines up, at 
https://github.com/rdkit/rdkit/blob/d594998dda2803a15aa0066e06f2477b71ed3be6/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L182
 , you'll see the features patterns defined in smartsPatterns

They appear to be identical to the list you gave.

I reimplemented the initialization function (copied at the end of this email). 
Running the program shows that it produces the same invariants which are used 
as the bit numbers in the Morgan feature fingerprint:

Invariant: 0
  atom 0 (C)
  atom 12 (C)
Invariant: 2
  atom 7 (O)
  atom 10 (O)
Invariant: 4
  atom 2 (C)
  atom 4 (C)
  atom 5 (C)
  atom 6 (C)
  atom 9 (C)
Invariant: 5
  atom 8 (N)
Invariant: 6
  atom 1 (N)
  atom 3 (N)
  atom 11 (N)


I believe that gives you two ways to get the information you want!

Best regards,

                                Andrew
                                da...@dalkescientific.com




# Python re-implementation of RDKit's getFeatureInvariants() from
# ./Code/GraphMol/Fingerprints/FingerprintUtil.cpp

from rdkit import Chem

smartsPatterns = [
    "[$([N;!H0;v3,v4&+1]),\
$([O,S;H1;+0]),\
n&H1&+0]",                                                  # Donor
    "[$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),\
$([O,S;H0;v2]),\
$([O,S;-]),\
$([N;v3;!$(N-*=[O,N,P,S])]),\
n&H0&+0,\
$([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]",                    # Acceptor
    "[a]",                                                  # Aromatic
    "[F,Cl,Br,I]",                                          # Halogen
    "[#7;+,\
$([N;H2&+0][$([C,a]);!$([C,a](=O))]),\
$([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),\
$([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]",  # Basic
    "[$([C,S](=[O,S,P])-[O;H1,-1])]"                        # Acidic
    ]

mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)[nH]c(=O)n2C")
invariants = [0] * mol.GetNumAtoms()
for pattern_idx, smartsPattern in enumerate(smartsPatterns):
    pat = Chem.MolFromSmarts(smartsPattern)
    for (atom_idx,) in mol.GetSubstructMatches(pat):
        invariants[atom_idx] |= (1<<pattern_idx)

# Show all atoms with the same invariant, ordered by invariant
from collections import defaultdict
by_invariant = defaultdict(list)
for atom_idx, invariant in enumerate(invariants):
    by_invariant[invariant].append(atom_idx)
for invariant, atom_indices in sorted(by_invariant.items()):
    print(f"Invariant: {invariant}")
    for atom_idx in atom_indices:
        print(f"  atom {atom_idx} ({mol.GetAtomWithIdx(atom_idx).GetSymbol()})")






_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to