Re: [Rdkit-discuss] Folding count vectors

2019-12-02 Thread Benjamin Datko
Hi Francois,

I agree with your suggestion. I am also CCing Greg on this response.

I have tried to look around on google for viewing the source code of the
CreateDifferenceFingerprintForReaction method but the most relevant pages I
can find describing what the code does are [here](
https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html)
and [here](
https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
)

I don't mind if the source is only in C++ but where can I find it? If I can
view the source code I could understand how folding a count vector was
implemented. As of right now I am assuming the implementation is similar
to folding a bit vector just applying a SUM instead of a logical OR.

v/r,

Ben

On Wed, Nov 20, 2019 at 3:23 AM Francois Berenger  wrote:

> On 20/11/2019 02:00, Benjamin Datko wrote:
> > Hello Francois,
> >
> > I am trying to replicate some of the functionality of
> > CreateDifferenceFingerprintForReaction [Ref 1] for my own
> > understanding on how the code works. The function
> > CreateDifferenceFingerprintForReaction allows for three difference
> > fingerprint representation of the molecules: AtomPair, Morgan, and
> > TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and
> > the function allows for variable fingerprint size output.
>
> Personally, I wouldn't try to fold a count vector.
> They are sparse vectors, so they don't take a lot of memory.
> Also, they are less information lossy than binary fingerprints.
>
> But, maybe Greg has some hack around, if you are really forced to do
> this.
>
> > I was following this post [Ref 4] describing how to create reaction
> > difference fingerprints using different fingerprints representation.
> > Using the code from the post I can create reaction difference
> > fingerprints using either Morgan or AtomPair, but comparing the output
> > from the post [Ref 4] to CreateDifferenceFingerprintForReaction
> > results in different size fingerprints, with different values within
> > the fingerprint, and different densities. I am assuming this due to
> > folding the count vector down to the default fingerprint size of 2048.
> >
> >
> > Example code snippet:
> >
> > # The below defs are from the post
> > https://sourceforge.net/p/rdkit/mailman/message/35240736/
> >
> > from rdkit import Chem
> > from rdkit.Chem import AllChem
> > from rdkit import DataStructs
> > import copy
> >
> > def _createFP(mol,maxSize,fpType='AP'):
> > mol.UpdatePropertyCache(False)
> > if fpType == 'AP':
> > return AllChem.GetAtomPairFingerprint(mol, minLength=1,
> > maxLength=maxSize)
> > else:
> > Chem.GetSSSR(mol)
> > rinfo = mol.GetRingInfo()
> > return AllChem.GetMorganFingerprint(mol, radius=maxSize)
> >
> > def getSumFps(fps):
> > summedFP = copy.deepcopy(fps[0])
> > for fp in fps[1:]:
> > summedFP += fp
> > return summedFP
> >
> > def buildReactionFP(rxn, maxSize=3, fpType='AP'):
> > reactants = rxn.GetReactants()
> > products = rxn.GetProducts()
> > rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
> > reactants])
> > pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
> > products])
> > return pFP-rFP
> >
>  rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1'
> > , useSmiles=True)
> >
>  rxfp1 = buildReactionFP(rxn1,maxSize=2)
> >
>  rxfp1.GetNonzeroElements()
> > {558114: -2, 574497: -1, 1066050: 2, 1066081: 1}
> >
>  rxfp1.GetLength()
> > 8388608
> >
> > # Same reaction now using CreateDifferenceFingerprintForReaction
>  rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1)
> >
>  rxn1_fp.GetNonzeroElements()
> >
> > {1048: 10,
> >  1310: -20,
> >  1325: 20,
> >  1372: -10,
> >  1390: 20,
> >  1692: -10,
> >  1757: -20,
> >  1772: 10}
> >
>  print(rxn1_fp.GetLength(),rxfp1.GetLength())
> > 2048 8388608
> >
> > References
> > 1.
> >
> https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
> > 2.
> >
> https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html
> > 3.
> >
> https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints
> > 4. https://sourceforge.net/p/rdkit/mailman/message/35240736/
> >
> > v/r,
> >
> > Ben
> >
> > On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger 
> > wrote:
> >
> >> On 19/11/2019 03:34, Benjamin Datko wrote:
> >>> Hello all,
> >>>
> >>> I am curious on how to fold a count vector fingerprint. I
> >> understand
> >>> when folding bit vectors the most common way is to split the
> >> vector in
> >>> half, and apply a bitwise OR operation. I think this is how the
> >>> function rdkit.DataStructs.FoldFingerprint works in RDKit, correct
> >> me
> >>> if I am wrong.
> >>>
> >>> How does RDKit and or what is the appropriate way to fold 

Re: [Rdkit-discuss] Folding count vectors

2019-11-20 Thread Francois Berenger

On 20/11/2019 02:00, Benjamin Datko wrote:

Hello Francois,

I am trying to replicate some of the functionality of
CreateDifferenceFingerprintForReaction [Ref 1] for my own
understanding on how the code works. The function
CreateDifferenceFingerprintForReaction allows for three difference
fingerprint representation of the molecules: AtomPair, Morgan, and
TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and
the function allows for variable fingerprint size output.


Personally, I wouldn't try to fold a count vector.
They are sparse vectors, so they don't take a lot of memory.
Also, they are less information lossy than binary fingerprints.

But, maybe Greg has some hack around, if you are really forced to do 
this.



I was following this post [Ref 4] describing how to create reaction
difference fingerprints using different fingerprints representation.
Using the code from the post I can create reaction difference
fingerprints using either Morgan or AtomPair, but comparing the output
from the post [Ref 4] to CreateDifferenceFingerprintForReaction
results in different size fingerprints, with different values within
the fingerprint, and different densities. I am assuming this due to
folding the count vector down to the default fingerprint size of 2048.


Example code snippet:

# The below defs are from the post
https://sourceforge.net/p/rdkit/mailman/message/35240736/

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
import copy

def _createFP(mol,maxSize,fpType='AP'):
mol.UpdatePropertyCache(False)
if fpType == 'AP':
return AllChem.GetAtomPairFingerprint(mol, minLength=1,
maxLength=maxSize)
else:
Chem.GetSSSR(mol)
rinfo = mol.GetRingInfo()
return AllChem.GetMorganFingerprint(mol, radius=maxSize)

def getSumFps(fps):
summedFP = copy.deepcopy(fps[0])
for fp in fps[1:]:
summedFP += fp
return summedFP

def buildReactionFP(rxn, maxSize=3, fpType='AP'):
reactants = rxn.GetReactants()
products = rxn.GetProducts()
rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
reactants])
pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
products])
return pFP-rFP


rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1'

, useSmiles=True)


rxfp1 = buildReactionFP(rxn1,maxSize=2)



rxfp1.GetNonzeroElements()

{558114: -2, 574497: -1, 1066050: 2, 1066081: 1}


rxfp1.GetLength()

8388608

# Same reaction now using CreateDifferenceFingerprintForReaction

rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1)



rxn1_fp.GetNonzeroElements()


{1048: 10,
 1310: -20,
 1325: 20,
 1372: -10,
 1390: 20,
 1692: -10,
 1757: -20,
 1772: 10}


print(rxn1_fp.GetLength(),rxfp1.GetLength())

2048 8388608

References
1.
https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
2.
https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html
3.
https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints
4. https://sourceforge.net/p/rdkit/mailman/message/35240736/

v/r,

Ben

On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger 
wrote:


On 19/11/2019 03:34, Benjamin Datko wrote:

Hello all,

I am curious on how to fold a count vector fingerprint. I

understand

when folding bit vectors the most common way is to split the

vector in

half, and apply a bitwise OR operation. I think this is how the
function rdkit.DataStructs.FoldFingerprint works in RDKit, correct

me

if I am wrong.

How does RDKit and or what is the appropriate way to fold count
vectors such as AtomPair, Morgan, and Topological torsion?


Can you give us some context? Why do you want to do that?

Maybe, you can use the following in order to create
shorter "fingerprints" for which the Tanimoto distance is
still computable (despite becoming approximate then):

---
Shrivastava, A. (2016).
Simple and efficient weighted minwise hashing.
In Advances in Neural Information Processing Systems (pp.
1498-1506).



https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf

---

Regards,
F.


I thought about turning the fingerprint into a bit vector using

their

respected "AsBitVect" method then folding using
rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't
have a "AsBitVect" method
[https://www.rdkit.org/docs/GettingStartedInPython.html].

For an explicit example using AtomPair fingerprint we can see the
fingerprint is extremely sparse. Could this AtomPair fingerprint

be

folded to increase the density?


from rdkit import Chem



from rdkit.Chem import AllChem



mol = Chem.MolFromSmiles('CC1C1')
ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1,

maxLength=3)


number_of_nonzero_elements =

len(ap_fp.GetNonzeroElements().values())


print((ap_fp.GetLength(),number_of_nonzero_elements))

(8388608,9)

Very Respectfully,

Ben

Re: [Rdkit-discuss] Folding count vectors

2019-11-19 Thread Benjamin Datko
Hello Francois,

I am trying to replicate some of the functionality of
CreateDifferenceFingerprintForReaction [Ref 1] for my own understanding on
how the code works. The function CreateDifferenceFingerprintForReaction
allows for three difference fingerprint representation of the molecules:
AtomPair, Morgan, and TopologicalTorsion [Ref 2]. All three are count
vectors [Ref 3], and the function allows for variable fingerprint size
output.

I was following this post [Ref 4] describing how to create reaction
difference fingerprints using different fingerprints representation. Using
the code from the post I can create reaction difference fingerprints using
either Morgan or AtomPair, but comparing the output from the post [Ref 4]
to CreateDifferenceFingerprintForReaction results in different size
fingerprints, with different values within the fingerprint, and different
densities. I am assuming this due to folding the count vector down to
the default fingerprint size of 2048.

Example code snippet:

# The below defs are from the post
https://sourceforge.net/p/rdkit/mailman/message/35240736/
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
import copy

def _createFP(mol,maxSize,fpType='AP'):
mol.UpdatePropertyCache(False)
if fpType == 'AP':
return AllChem.GetAtomPairFingerprint(mol, minLength=1,
maxLength=maxSize)
else:
Chem.GetSSSR(mol)
rinfo = mol.GetRingInfo()
return AllChem.GetMorganFingerprint(mol, radius=maxSize)

def getSumFps(fps):
summedFP = copy.deepcopy(fps[0])
for fp in fps[1:]:
summedFP += fp
return summedFP

def buildReactionFP(rxn, maxSize=3, fpType='AP'):
reactants = rxn.GetReactants()
products = rxn.GetProducts()
rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
reactants])
pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
products])
return pFP-rFP

>>> rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1' ,
useSmiles=True)
>>> rxfp1 = buildReactionFP(rxn1,maxSize=2)

>>> rxfp1.GetNonzeroElements()
{558114: -2, 574497: -1, 1066050: 2, 1066081: 1}

>>> rxfp1.GetLength()
8388608


# Same reaction now using CreateDifferenceFingerprintForReaction
>>> rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1)

>>> rxn1_fp.GetNonzeroElements()
{1048: 10,
 1310: -20,
 1325: 20,
 1372: -10,
 1390: 20,
 1692: -10,
 1757: -20,
 1772: 10}

>>> print(rxn1_fp.GetLength(),rxfp1.GetLength())
2048 8388608

References
1.
https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
2.
https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html
3.
https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints
4. https://sourceforge.net/p/rdkit/mailman/message/35240736/

v/r,

Ben

On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger  wrote:

> On 19/11/2019 03:34, Benjamin Datko wrote:
> > Hello all,
> >
> > I am curious on how to fold a count vector fingerprint. I understand
> > when folding bit vectors the most common way is to split the vector in
> > half, and apply a bitwise OR operation. I think this is how the
> > function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me
> > if I am wrong.
> >
> > How does RDKit and or what is the appropriate way to fold count
> > vectors such as AtomPair, Morgan, and Topological torsion?
>
> Can you give us some context? Why do you want to do that?
>
> Maybe, you can use the following in order to create
> shorter "fingerprints" for which the Tanimoto distance is
> still computable (despite becoming approximate then):
>
> ---
> Shrivastava, A. (2016).
> Simple and efficient weighted minwise hashing.
> In Advances in Neural Information Processing Systems (pp. 1498-1506).
>
>
> https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf
> ---
>
> Regards,
> F.
>
> > I thought about turning the fingerprint into a bit vector using their
> > respected "AsBitVect" method then folding using
> > rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't
> > have a "AsBitVect" method
> > [https://www.rdkit.org/docs/GettingStartedInPython.html].
> >
> > For an explicit example using AtomPair fingerprint we can see the
> > fingerprint is extremely sparse. Could this AtomPair fingerprint be
> > folded to increase the density?
> >
>  from rdkit import Chem
> >
>  from rdkit.Chem import AllChem
> >
>  mol = Chem.MolFromSmiles('CC1C1')
>  ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1,
> > maxLength=3)
> >
>  number_of_nonzero_elements =
> > len(ap_fp.GetNonzeroElements().values())
> >
>  print((ap_fp.GetLength(),number_of_nonzero_elements))
> > (8388608,9)
> >
> > Very Respectfully,
> >
> > Ben
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > 

Re: [Rdkit-discuss] Folding count vectors

2019-11-18 Thread Francois Berenger

On 19/11/2019 03:34, Benjamin Datko wrote:

Hello all,

I am curious on how to fold a count vector fingerprint. I understand
when folding bit vectors the most common way is to split the vector in
half, and apply a bitwise OR operation. I think this is how the
function rdkit.DataStructs.FoldFingerprint works in RDKit, correct me
if I am wrong.

How does RDKit and or what is the appropriate way to fold count
vectors such as AtomPair, Morgan, and Topological torsion?


Can you give us some context? Why do you want to do that?

Maybe, you can use the following in order to create
shorter "fingerprints" for which the Tanimoto distance is
still computable (despite becoming approximate then):

---
Shrivastava, A. (2016).
Simple and efficient weighted minwise hashing.
In Advances in Neural Information Processing Systems (pp. 1498-1506).

https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf
---

Regards,
F.


I thought about turning the fingerprint into a bit vector using their
respected "AsBitVect" method then folding using
rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't
have a "AsBitVect" method
[https://www.rdkit.org/docs/GettingStartedInPython.html].

For an explicit example using AtomPair fingerprint we can see the
fingerprint is extremely sparse. Could this AtomPair fingerprint be
folded to increase the density?


from rdkit import Chem



from rdkit.Chem import AllChem



mol = Chem.MolFromSmiles('CC1C1')
ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1,

maxLength=3)


number_of_nonzero_elements =

len(ap_fp.GetNonzeroElements().values())


print((ap_fp.GetLength(),number_of_nonzero_elements))

(8388608,9)

Very Respectfully,

Ben
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Folding count vectors

2019-11-18 Thread Benjamin Datko
Hello all,

I am curious on how to fold a count vector fingerprint. I understand when
folding bit vectors the most common way is to split the vector in half, and
apply a bitwise OR operation. I think this is how the function
rdkit.DataStructs.FoldFingerprint works in RDKit, correct me if I am wrong.

How does RDKit and or what is the appropriate way to fold count vectors
such as AtomPair, Morgan, and Topological torsion?

I thought about turning the fingerprint into a bit vector using their
respected "AsBitVect" method then folding using
rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't have a "
AsBitVect" method [https://www.rdkit.org/docs/GettingStartedInPython.html].

For an explicit example using AtomPair fingerprint we can see the
fingerprint is extremely sparse. Could this AtomPair fingerprint be folded
to increase the density?

>>> from rdkit import Chem
>>> from rdkit.Chem import AllChem

>>> mol = Chem.MolFromSmiles('CC1C1')
>>> ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, maxLength=3)

>>> number_of_nonzero_elements = len(ap_fp.GetNonzeroElements().values())

>>> print((ap_fp.GetLength(),number_of_nonzero_elements))
(8388608,9)

Very Respectfully,

Ben
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss