Re: [Rdkit-discuss] Folding count vectors
Hi Francois, I agree with your suggestion. I am also CCing Greg on this response. I have tried to look around on google for viewing the source code of the CreateDifferenceFingerprintForReaction method but the most relevant pages I can find describing what the code does are [here]( https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html) and [here]( https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction ) I don't mind if the source is only in C++ but where can I find it? If I can view the source code I could understand how folding a count vector was implemented. As of right now I am assuming the implementation is similar to folding a bit vector just applying a SUM instead of a logical OR. v/r, Ben On Wed, Nov 20, 2019 at 3:23 AM Francois Berenger wrote: > On 20/11/2019 02:00, Benjamin Datko wrote: > > Hello Francois, > > > > I am trying to replicate some of the functionality of > > CreateDifferenceFingerprintForReaction [Ref 1] for my own > > understanding on how the code works. The function > > CreateDifferenceFingerprintForReaction allows for three difference > > fingerprint representation of the molecules: AtomPair, Morgan, and > > TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and > > the function allows for variable fingerprint size output. > > Personally, I wouldn't try to fold a count vector. > They are sparse vectors, so they don't take a lot of memory. > Also, they are less information lossy than binary fingerprints. > > But, maybe Greg has some hack around, if you are really forced to do > this. > > > I was following this post [Ref 4] describing how to create reaction > > difference fingerprints using different fingerprints representation. > > Using the code from the post I can create reaction difference > > fingerprints using either Morgan or AtomPair, but comparing the output > > from the post [Ref 4] to CreateDifferenceFingerprintForReaction > > results in different size fingerprints, with different values within > > the fingerprint, and different densities. I am assuming this due to > > folding the count vector down to the default fingerprint size of 2048. > > > > > > Example code snippet: > > > > # The below defs are from the post > > https://sourceforge.net/p/rdkit/mailman/message/35240736/ > > > > from rdkit import Chem > > from rdkit.Chem import AllChem > > from rdkit import DataStructs > > import copy > > > > def _createFP(mol,maxSize,fpType='AP'): > > mol.UpdatePropertyCache(False) > > if fpType == 'AP': > > return AllChem.GetAtomPairFingerprint(mol, minLength=1, > > maxLength=maxSize) > > else: > > Chem.GetSSSR(mol) > > rinfo = mol.GetRingInfo() > > return AllChem.GetMorganFingerprint(mol, radius=maxSize) > > > > def getSumFps(fps): > > summedFP = copy.deepcopy(fps[0]) > > for fp in fps[1:]: > > summedFP += fp > > return summedFP > > > > def buildReactionFP(rxn, maxSize=3, fpType='AP'): > > reactants = rxn.GetReactants() > > products = rxn.GetProducts() > > rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in > > reactants]) > > pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in > > products]) > > return pFP-rFP > > > rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1' > > , useSmiles=True) > > > rxfp1 = buildReactionFP(rxn1,maxSize=2) > > > rxfp1.GetNonzeroElements() > > {558114: -2, 574497: -1, 1066050: 2, 1066081: 1} > > > rxfp1.GetLength() > > 8388608 > > > > # Same reaction now using CreateDifferenceFingerprintForReaction > rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1) > > > rxn1_fp.GetNonzeroElements() > > > > {1048: 10, > > 1310: -20, > > 1325: 20, > > 1372: -10, > > 1390: 20, > > 1692: -10, > > 1757: -20, > > 1772: 10} > > > print(rxn1_fp.GetLength(),rxfp1.GetLength()) > > 2048 8388608 > > > > References > > 1. > > > https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction > > 2. > > > https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html > > 3. > > > https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints > > 4. https://sourceforge.net/p/rdkit/mailman/message/35240736/ > > > > v/r, > > > > Ben > > > > On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger > > wrote: > > > >> On 19/11/2019 03:34, Benjamin Datko wrote: > >>> Hello all, > >>> > >>> I am curious on how to fold a count vector fingerprint. I > >> understand > >>> when folding bit vectors the most common way is to split the > >> vector in > >>> half, and apply a bitwise OR operation. I think this is how the > >>> function rdkit.DataStructs.FoldFingerprint works in RDKit, correct > >> me > >>> if I am wrong. > >>> > >>> How does RDKit and or what is the appropriate way to fold
Re: [Rdkit-discuss] Folding count vectors
On 20/11/2019 02:00, Benjamin Datko wrote: Hello Francois, I am trying to replicate some of the functionality of CreateDifferenceFingerprintForReaction [Ref 1] for my own understanding on how the code works. The function CreateDifferenceFingerprintForReaction allows for three difference fingerprint representation of the molecules: AtomPair, Morgan, and TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and the function allows for variable fingerprint size output. Personally, I wouldn't try to fold a count vector. They are sparse vectors, so they don't take a lot of memory. Also, they are less information lossy than binary fingerprints. But, maybe Greg has some hack around, if you are really forced to do this. I was following this post [Ref 4] describing how to create reaction difference fingerprints using different fingerprints representation. Using the code from the post I can create reaction difference fingerprints using either Morgan or AtomPair, but comparing the output from the post [Ref 4] to CreateDifferenceFingerprintForReaction results in different size fingerprints, with different values within the fingerprint, and different densities. I am assuming this due to folding the count vector down to the default fingerprint size of 2048. Example code snippet: # The below defs are from the post https://sourceforge.net/p/rdkit/mailman/message/35240736/ from rdkit import Chem from rdkit.Chem import AllChem from rdkit import DataStructs import copy def _createFP(mol,maxSize,fpType='AP'): mol.UpdatePropertyCache(False) if fpType == 'AP': return AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=maxSize) else: Chem.GetSSSR(mol) rinfo = mol.GetRingInfo() return AllChem.GetMorganFingerprint(mol, radius=maxSize) def getSumFps(fps): summedFP = copy.deepcopy(fps[0]) for fp in fps[1:]: summedFP += fp return summedFP def buildReactionFP(rxn, maxSize=3, fpType='AP'): reactants = rxn.GetReactants() products = rxn.GetProducts() rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in reactants]) pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in products]) return pFP-rFP rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1' , useSmiles=True) rxfp1 = buildReactionFP(rxn1,maxSize=2) rxfp1.GetNonzeroElements() {558114: -2, 574497: -1, 1066050: 2, 1066081: 1} rxfp1.GetLength() 8388608 # Same reaction now using CreateDifferenceFingerprintForReaction rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1) rxn1_fp.GetNonzeroElements() {1048: 10, 1310: -20, 1325: 20, 1372: -10, 1390: 20, 1692: -10, 1757: -20, 1772: 10} print(rxn1_fp.GetLength(),rxfp1.GetLength()) 2048 8388608 References 1. https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction 2. https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html 3. https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints 4. https://sourceforge.net/p/rdkit/mailman/message/35240736/ v/r, Ben On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger wrote: On 19/11/2019 03:34, Benjamin Datko wrote: Hello all, I am curious on how to fold a count vector fingerprint. I understand when folding bit vectors the most common way is to split the vector in half, and apply a bitwise OR operation. I think this is how the function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me if I am wrong. How does RDKit and or what is the appropriate way to fold count vectors such as AtomPair, Morgan, and Topological torsion? Can you give us some context? Why do you want to do that? Maybe, you can use the following in order to create shorter "fingerprints" for which the Tanimoto distance is still computable (despite becoming approximate then): --- Shrivastava, A. (2016). Simple and efficient weighted minwise hashing. In Advances in Neural Information Processing Systems (pp. 1498-1506). https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf --- Regards, F. I thought about turning the fingerprint into a bit vector using their respected "AsBitVect" method then folding using rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't have a "AsBitVect" method [https://www.rdkit.org/docs/GettingStartedInPython.html]. For an explicit example using AtomPair fingerprint we can see the fingerprint is extremely sparse. Could this AtomPair fingerprint be folded to increase the density? from rdkit import Chem from rdkit.Chem import AllChem mol = Chem.MolFromSmiles('CC1C1') ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=3) number_of_nonzero_elements = len(ap_fp.GetNonzeroElements().values()) print((ap_fp.GetLength(),number_of_nonzero_elements)) (8388608,9) Very Respectfully, Ben
Re: [Rdkit-discuss] Folding count vectors
Hello Francois, I am trying to replicate some of the functionality of CreateDifferenceFingerprintForReaction [Ref 1] for my own understanding on how the code works. The function CreateDifferenceFingerprintForReaction allows for three difference fingerprint representation of the molecules: AtomPair, Morgan, and TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and the function allows for variable fingerprint size output. I was following this post [Ref 4] describing how to create reaction difference fingerprints using different fingerprints representation. Using the code from the post I can create reaction difference fingerprints using either Morgan or AtomPair, but comparing the output from the post [Ref 4] to CreateDifferenceFingerprintForReaction results in different size fingerprints, with different values within the fingerprint, and different densities. I am assuming this due to folding the count vector down to the default fingerprint size of 2048. Example code snippet: # The below defs are from the post https://sourceforge.net/p/rdkit/mailman/message/35240736/ from rdkit import Chem from rdkit.Chem import AllChem from rdkit import DataStructs import copy def _createFP(mol,maxSize,fpType='AP'): mol.UpdatePropertyCache(False) if fpType == 'AP': return AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=maxSize) else: Chem.GetSSSR(mol) rinfo = mol.GetRingInfo() return AllChem.GetMorganFingerprint(mol, radius=maxSize) def getSumFps(fps): summedFP = copy.deepcopy(fps[0]) for fp in fps[1:]: summedFP += fp return summedFP def buildReactionFP(rxn, maxSize=3, fpType='AP'): reactants = rxn.GetReactants() products = rxn.GetProducts() rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in reactants]) pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in products]) return pFP-rFP >>> rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1' , useSmiles=True) >>> rxfp1 = buildReactionFP(rxn1,maxSize=2) >>> rxfp1.GetNonzeroElements() {558114: -2, 574497: -1, 1066050: 2, 1066081: 1} >>> rxfp1.GetLength() 8388608 # Same reaction now using CreateDifferenceFingerprintForReaction >>> rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1) >>> rxn1_fp.GetNonzeroElements() {1048: 10, 1310: -20, 1325: 20, 1372: -10, 1390: 20, 1692: -10, 1757: -20, 1772: 10} >>> print(rxn1_fp.GetLength(),rxfp1.GetLength()) 2048 8388608 References 1. https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction 2. https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html 3. https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints 4. https://sourceforge.net/p/rdkit/mailman/message/35240736/ v/r, Ben On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger wrote: > On 19/11/2019 03:34, Benjamin Datko wrote: > > Hello all, > > > > I am curious on how to fold a count vector fingerprint. I understand > > when folding bit vectors the most common way is to split the vector in > > half, and apply a bitwise OR operation. I think this is how the > > function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me > > if I am wrong. > > > > How does RDKit and or what is the appropriate way to fold count > > vectors such as AtomPair, Morgan, and Topological torsion? > > Can you give us some context? Why do you want to do that? > > Maybe, you can use the following in order to create > shorter "fingerprints" for which the Tanimoto distance is > still computable (despite becoming approximate then): > > --- > Shrivastava, A. (2016). > Simple and efficient weighted minwise hashing. > In Advances in Neural Information Processing Systems (pp. 1498-1506). > > > https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf > --- > > Regards, > F. > > > I thought about turning the fingerprint into a bit vector using their > > respected "AsBitVect" method then folding using > > rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't > > have a "AsBitVect" method > > [https://www.rdkit.org/docs/GettingStartedInPython.html]. > > > > For an explicit example using AtomPair fingerprint we can see the > > fingerprint is extremely sparse. Could this AtomPair fingerprint be > > folded to increase the density? > > > from rdkit import Chem > > > from rdkit.Chem import AllChem > > > mol = Chem.MolFromSmiles('CC1C1') > ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, > > maxLength=3) > > > number_of_nonzero_elements = > > len(ap_fp.GetNonzeroElements().values()) > > > print((ap_fp.GetLength(),number_of_nonzero_elements)) > > (8388608,9) > > > > Very Respectfully, > > > > Ben > > ___ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > >
Re: [Rdkit-discuss] Folding count vectors
On 19/11/2019 03:34, Benjamin Datko wrote: Hello all, I am curious on how to fold a count vector fingerprint. I understand when folding bit vectors the most common way is to split the vector in half, and apply a bitwise OR operation. I think this is how the function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me if I am wrong. How does RDKit and or what is the appropriate way to fold count vectors such as AtomPair, Morgan, and Topological torsion? Can you give us some context? Why do you want to do that? Maybe, you can use the following in order to create shorter "fingerprints" for which the Tanimoto distance is still computable (despite becoming approximate then): --- Shrivastava, A. (2016). Simple and efficient weighted minwise hashing. In Advances in Neural Information Processing Systems (pp. 1498-1506). https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf --- Regards, F. I thought about turning the fingerprint into a bit vector using their respected "AsBitVect" method then folding using rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't have a "AsBitVect" method [https://www.rdkit.org/docs/GettingStartedInPython.html]. For an explicit example using AtomPair fingerprint we can see the fingerprint is extremely sparse. Could this AtomPair fingerprint be folded to increase the density? from rdkit import Chem from rdkit.Chem import AllChem mol = Chem.MolFromSmiles('CC1C1') ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=3) number_of_nonzero_elements = len(ap_fp.GetNonzeroElements().values()) print((ap_fp.GetLength(),number_of_nonzero_elements)) (8388608,9) Very Respectfully, Ben ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Folding count vectors
Hello all, I am curious on how to fold a count vector fingerprint. I understand when folding bit vectors the most common way is to split the vector in half, and apply a bitwise OR operation. I think this is how the function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me if I am wrong. How does RDKit and or what is the appropriate way to fold count vectors such as AtomPair, Morgan, and Topological torsion? I thought about turning the fingerprint into a bit vector using their respected "AsBitVect" method then folding using rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't have a " AsBitVect" method [https://www.rdkit.org/docs/GettingStartedInPython.html]. For an explicit example using AtomPair fingerprint we can see the fingerprint is extremely sparse. Could this AtomPair fingerprint be folded to increase the density? >>> from rdkit import Chem >>> from rdkit.Chem import AllChem >>> mol = Chem.MolFromSmiles('CC1C1') >>> ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=3) >>> number_of_nonzero_elements = len(ap_fp.GetNonzeroElements().values()) >>> print((ap_fp.GetLength(),number_of_nonzero_elements)) (8388608,9) Very Respectfully, Ben ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss