Hi Paolo, Many thanks for the detailed explanation! Standing by your statement "If the invariants are provided by the user, they will be used instead", I attempted to reproduce the default ECFP fingerprint for a small and a large molecule. Here is the code:
import numpy as np from rdkit import DataStructs from rdkit.Chem import PeriodicTable, GetPeriodicTable, AllChem from rdkit import Chem def getNumpyArray(fp): arr = np.zeros((1,), np.float32) DataStructs.ConvertToNumpyArray(fp, arr) return arr def generateECFPAtomInvariant(mol, discrete_charges=False): num_atoms = mol.GetNumAtoms() invariants = [0]*num_atoms ring_info = mol.GetRingInfo() for i,a in enumerate(mol.GetAtoms()): descriptors=[] descriptors.append(a.GetAtomicNum()) descriptors.append(a.GetTotalDegree()) descriptors.append(a.GetTotalNumHs()) descriptors.append(a.GetFormalCharge()) descriptors.append(a.GetMass() - PeriodicTable.GetAtomicWeight(GetPeriodicTable(), a.GetSymbol())) descriptors.append(ring_info.NumAtomRings(i)) invariants[i]=hash(tuple(descriptors))& 0xffffffff return invariants for SMILES in ['Cc1ncccc1', 'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H](NCc3c2)C(=O)N2CCCCC2)c1']: mol = Chem.MolFromSmiles(SMILES) mol = Chem.AddHs(mol) invariants = generateECFPAtomInvariant(mol) info, infoi = {}, {} fp = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=8192, invariants=[], bitInfo=info)) fpi = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=8192, invariants=invariants, bitInfo=infoi)) print("Do the substructures extracted by default invariants and user-defined invariants match?", set(info.values()) == set(infoi.values())) print("Number of mis-matching bits between fp and fpi=", fp.shape[0] - np.count_nonzero(np.equal(fp, fpi))) To assess whether the fingerprints match, I compared the values of the 'bitInfo' dictionaries. The keys are hash codes, which of course do not match, but the values are pairs of (atomID, radius), which should be the same. As you will the (atomID, radius) pairs for the small molecule match *but not for the large one*. I also compared the two bitstrings per se, but I suppose, due to the usage if different (?) hash functions, the bits don't match neither for the small nor for the large molecule. Moreover, when I implement the 'generateECFPAtomInvariant()' function on a large scale, namely generating fingerprint to train an ML model, the number of invariant bits is 2360 using the default ECFP atom invariants, while with user-defined invariants are much less (795) and the performance of the ML model is significantly different. Could someone point out what I am doing wrong? ~Thomas -- ====================================================================== Dr. Thomas Evangelidis Research Scientist IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, Prague, Czech Republic & CEITEC - Central European Institute of Technology <https://www.ceitec.eu/>, Brno, Czech Republic email: teva...@gmail.com, Twitter: tevangelidis <https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis <https://www.linkedin.com/in/thomas-evangelidis-495b45125/> website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss