Re: [Rdkit-discuss] Request for Assistance with MACCS 166 Fingerprint Calculation for 3D QSAR Study

2024-05-02 Thread Ariadna Llop Peiró
Hello everyone,

Thank you for all your helpful suggestions.

I've taken careful note of them, and they have been extremely helpful in
guiding my work.
3D-QSAR is also new for me and your insights and expertise have been
incredibly valuable.

Thank you once again for your generous assistance.

Best Regards,

Ariadna Llop

Missatge de Andrew Dalke  del dia dt., 30 d’abr.
2024 a les 22:45:

> Hi Ariadna,
>
>   In general the MACCS keys are not that good for comparing similarity.
> They exist still for historical reasons. Back in the 1970s the company
> Molecular Design Limited developed a program called "Molecular Access
> System" (MACCS) for structure registration, substructure search, and the
> like.
>
> Substructure search is slow, so MACCS includes a set of keys which would
> act as fast filters - if the query contained a key but the database entry
> did not, then the query could not match that entry.
>
> In the 1980s when fingerprint similarity search first became popular -
> this is before the term "fingerprint" was even coined - people used the
> MACCS keys because they were already computed and sitting there, on the
> computer system they were already using.
>
> Over time people developed other types of fingerprints, and different ways
> to compare them, and a more complete understanding of how they are coupled
> to the types of system being studied.
>
> For example, in "Comparing structural fingerprints using a
> literature-based similarity benchmark" by Sayle and O'Boyle,
> "Extended-connectivity fingerprints of diameter 4 and 6 are among the best
> performing fingerprints when ranking diverse structures by similarity, as
> is the topological torsion fingerprint. However, when ranking very close
> analogues, the atom pair fingerprint outperforms the others tested."
>
> They found the MACCS fingerprints to be one of the worst performers, which
> you might expect now that you know the happenstance which made them popular.
>
> Since you are doing 3D QSAR, you should familiarize yourself with the
> fingerprints used in that area. I have no experience with 3D QSAR and
> cannot provide advice on what is appropriate.
>
> The first paper I found using Google Scholar to search for "3d qsar
> fingerprints" is "Docking, Interaction Fingerprint, and Three-Dimensional
> Quantitative Structure–Activity Relationship (3D-QSAR) of Sigma1 Receptor
> Ligands, Analogs of the Neuroprotective Agent RC-33" at
> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6637851/ which uses
> Interaction fingerprints.
>
> The second is "Novel TOPP descriptors in 3D-QSAR analysis of apoptosis
> inducing 4-aryl-4H-chromenes: Comparison versus other 2D- and
> 3D-descriptors" at
> https://www.sciencedirect.com/science/article/pii/S0968089607005834 which
> I mention to because it summarizes 7 different descriptor-based approaches,
> and places the MACCS keys in last place, far below the second worst ("TOPP
> > GRIND > BCI 4096 = ECFP > FCFP > GRID-GOLPE ≫ DRAGON ⋙ MDL 166").
>
> No doubt there are many others for you to read through and try out.
>
>
> > # Generate fingerprint descriptor database
> > fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]
>
> What I can suggest is you try my chemfp package, specifically the 3.2b1 I
> just released (bear in mind that it is beta!)
>
> You can install it with:
>
>python -m pip install chemfp==4.2b1 -i https://chemfp.com/packages/
>
> To generate Morgan fingerprints of radius 2, I suggest you compute them
> once and store them in a file, like this command-line example:
>
>   rdkit2fps --morgan2 dataset.smi -o dataset.fps
>
> (use "--maccs" to generate MACCS keys, "--pair" for atom pairs; and use
> "--help" to see what other options are available.)
>
> To "Calculate pairwise Tanimoto similarity between fingerprints" as a
> distance, you can use another command-line tool to generate the matrix as a
> NumPy "npy" file, like this:
>
>   chemfp simarray dataset.fps --as-distance -o dataset.npy
>
> To load this in Python:
>
>   import numpy as np
>   dists = np.load("dataset.npy")
>
> If you also need the identifiers:
>
>   with open("dataset.npy", "rb") as f:
> dists = np.load(f)
> metadata = np.load(f)
> ids = np.load(f)
>
> This should make it easier to iterate over the different clustering
> methods available, since you only generate the fingerprints and distance
> matrix once.
>
> If you decide to use interaction fingerprints, or some other fingerprint
> type that is not in the RDKit, you can still generate the fingerprints in
> FPS format (a simple text format) and use chemfp to generate your matrix
> for you, either on the command-line or through its Python API.
>
> > However, I'm not satisfied with the results and would like to experiment
> with MACCS Keys to see if they yield better clustering outcomes. Does
> anyone know how to cluster compounds using MACCS fingerprints? Any insights
> on the best approach to calculate similarities and cluster using these
> 

Re: [Rdkit-discuss] Request for Assistance with MACCS 166 Fingerprint Calculation for 3D QSAR Study

2024-04-30 Thread Andrew Dalke
Hi Ariadna,

  In general the MACCS keys are not that good for comparing similarity. They 
exist still for historical reasons. Back in the 1970s the company Molecular 
Design Limited developed a program called "Molecular Access System" (MACCS) for 
structure registration, substructure search, and the like.

Substructure search is slow, so MACCS includes a set of keys which would act as 
fast filters - if the query contained a key but the database entry did not, 
then the query could not match that entry.

In the 1980s when fingerprint similarity search first became popular - this is 
before the term "fingerprint" was even coined - people used the MACCS keys 
because they were already computed and sitting there, on the computer system 
they were already using.

Over time people developed other types of fingerprints, and different ways to 
compare them, and a more complete understanding of how they are coupled to the 
types of system being studied.

For example, in "Comparing structural fingerprints using a literature-based 
similarity benchmark" by Sayle and O'Boyle, "Extended-connectivity fingerprints 
of diameter 4 and 6 are among the best performing fingerprints when ranking 
diverse structures by similarity, as is the topological torsion fingerprint. 
However, when ranking very close analogues, the atom pair fingerprint 
outperforms the others tested."

They found the MACCS fingerprints to be one of the worst performers, which you 
might expect now that you know the happenstance which made them popular.

Since you are doing 3D QSAR, you should familiarize yourself with the 
fingerprints used in that area. I have no experience with 3D QSAR and cannot 
provide advice on what is appropriate. 

The first paper I found using Google Scholar to search for "3d qsar 
fingerprints" is "Docking, Interaction Fingerprint, and Three-Dimensional 
Quantitative Structure–Activity Relationship (3D-QSAR) of Sigma1 Receptor 
Ligands, Analogs of the Neuroprotective Agent RC-33" at 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6637851/ which uses Interaction 
fingerprints.

The second is "Novel TOPP descriptors in 3D-QSAR analysis of apoptosis inducing 
4-aryl-4H-chromenes: Comparison versus other 2D- and 3D-descriptors" at 
https://www.sciencedirect.com/science/article/pii/S0968089607005834 which I 
mention to because it summarizes 7 different descriptor-based approaches, and 
places the MACCS keys in last place, far below the second worst ("TOPP > GRIND 
> BCI 4096 = ECFP > FCFP > GRID-GOLPE ≫ DRAGON ⋙ MDL 166").

No doubt there are many others for you to read through and try out.


> # Generate fingerprint descriptor database
> fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]

What I can suggest is you try my chemfp package, specifically the 3.2b1 I just 
released (bear in mind that it is beta!)

You can install it with:

   python -m pip install chemfp==4.2b1 -i https://chemfp.com/packages/

To generate Morgan fingerprints of radius 2, I suggest you compute them once 
and store them in a file, like this command-line example:

  rdkit2fps --morgan2 dataset.smi -o dataset.fps

(use "--maccs" to generate MACCS keys, "--pair" for atom pairs; and use 
"--help" to see what other options are available.)

To "Calculate pairwise Tanimoto similarity between fingerprints" as a distance, 
you can use another command-line tool to generate the matrix as a NumPy "npy" 
file, like this:

  chemfp simarray dataset.fps --as-distance -o dataset.npy

To load this in Python:

  import numpy as np
  dists = np.load("dataset.npy")

If you also need the identifiers:

  with open("dataset.npy", "rb") as f:
dists = np.load(f)
metadata = np.load(f)
ids = np.load(f)

This should make it easier to iterate over the different clustering methods 
available, since you only generate the fingerprints and distance matrix once.

If you decide to use interaction fingerprints, or some other fingerprint type 
that is not in the RDKit, you can still generate the fingerprints in FPS format 
(a simple text format) and use chemfp to generate your matrix for you, either 
on the command-line or through its Python API.

> However, I'm not satisfied with the results and would like to experiment with 
> MACCS Keys to see if they yield better clustering outcomes. Does anyone know 
> how to cluster compounds using MACCS fingerprints? Any insights on the best 
> approach to calculate similarities and cluster using these fingerprints would 
> be highly appreciated.

In case I was not clear enough before, MACCS keys make poor fingerprints. There 
is no reason to expect they will yield better clustering outcomes, and multiple 
papers which suggest they will make worse outcomes.

Best regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

Re: [Rdkit-discuss] Request for Assistance with MACCS 166 Fingerprint Calculation for 3D QSAR Study

2024-04-23 Thread Greg Landrum
Hi,

Please do not duplicate questions/posts between the mailing list and github
discussions. That's spamming the community.

-greg


On Tue, Apr 23, 2024 at 4:10 PM Ariadna Llop Peiró 
wrote:

> Hello everyone,
>
> I'm currently working with a dataset of chemical compounds, aiming to
> cluster them into different series to create a 3D-QSAR model. Up to this
> point, I've been using Morgan Fingerprints to generate the descriptors and
> cluster the compounds based on their Tanimoto Similarity:
>
> ```
> # Generate fingerprint descriptor database
> fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]
>
>
> # Calculate pairwise Tanimoto similarity between fingerprints
> similarity_matrix = []
> for i in range(len(fps)):
> similarities = []
> for j in range(len(fps)):
> similarities.append(DataStructs.TanimotoSimilarity(fps[i], fps[j]))
>
> similarity_matrix.append(similarities)
> ```
>
>
> With the similarity matrix, I applied hierarchical clustering based on a
> Tanimoto Similarity threshold to group similar compounds:
>
> ```
> # Cluster based on Tanimoto similarity
> dists = 1 - np.array(similarity_matrix)
> hc = hierarchy.linkage(squareform(dists), method='single')
>
> # Specify a distance threshold or number of clusters
> threshold = 0.6  # Adjust this value based on your dendrogram and
> similarity values
> clusters = hierarchy.fcluster(hc, threshold, criterion='distance')
> ```
>
> However, I'm not satisfied with the results and would like to experiment
> with MACCS Keys to see if they yield better clustering outcomes. Does
> anyone know how to cluster compounds using MACCS fingerprints? Any insights
> on the best approach to calculate similarities and cluster using these
> fingerprints would be highly appreciated.
>
> Thank you in advance for your suggestions!
>
> Ariadna Llop
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Request for Assistance with MACCS 166 Fingerprint Calculation for 3D QSAR Study

2024-04-23 Thread Ariadna Llop Peiró
Hello everyone,

I'm currently working with a dataset of chemical compounds, aiming to
cluster them into different series to create a 3D-QSAR model. Up to this
point, I've been using Morgan Fingerprints to generate the descriptors and
cluster the compounds based on their Tanimoto Similarity:

```
# Generate fingerprint descriptor database
fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]


# Calculate pairwise Tanimoto similarity between fingerprints
similarity_matrix = []
for i in range(len(fps)):
similarities = []
for j in range(len(fps)):
similarities.append(DataStructs.TanimotoSimilarity(fps[i], fps[j]))

similarity_matrix.append(similarities)
```


With the similarity matrix, I applied hierarchical clustering based on a
Tanimoto Similarity threshold to group similar compounds:

```
# Cluster based on Tanimoto similarity
dists = 1 - np.array(similarity_matrix)
hc = hierarchy.linkage(squareform(dists), method='single')

# Specify a distance threshold or number of clusters
threshold = 0.6  # Adjust this value based on your dendrogram and
similarity values
clusters = hierarchy.fcluster(hc, threshold, criterion='distance')
```

However, I'm not satisfied with the results and would like to experiment
with MACCS Keys to see if they yield better clustering outcomes. Does
anyone know how to cluster compounds using MACCS fingerprints? Any insights
on the best approach to calculate similarities and cluster using these
fingerprints would be highly appreciated.

Thank you in advance for your suggestions!

Ariadna Llop
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss