[Rdkit-discuss] Question: Explaining bits from Morgan Fingerprints

2021-07-21 Thread Bilal Nizami
Dear RDKit community,

I am trying to explain the Explaining bits from Morgan Fingerprints as
described in the RDKit getting started guide here (
http://www.rdkit.org/docs/GettingStartedInPython.html#explaining-bits-from-morgan-fingerprints
).

I want to get the SMILES for each bit from morgan FP of this molecule.

[image: image.png]
I run this python snippet

*m = Chem.MolFromSmiles('OC(=O)C1=CC(O)=C(O)C(O)=C1')*
*info = {}*
*atoms=set()*
*for key in Chem.GetMorganFingerprint(m,radius,
bitInfo=info).GetNonzeroElements():*
*print (key)*
*print('fp bit: ', info[key], ' and length is: ', len(info[key]))*
*env =
Chem.FindAtomEnvironmentOfRadiusN(m,info[key][0][1],info[key][0][0])*
*print('Will use the atom: ', info[key][0][0], ' and radius; ',
info[key][0][1])*
*display(Draw.DrawMorganBit(m, key, info, useSVG=True))*
*for bidx in env:*
*atoms.add(m.GetBondWithIdx(bidx).GetBeginAtomIdx())*
*atoms.add(m.GetBondWithIdx(bidx).GetEndAtomIdx())*
*print('atom To Use: ', list(atoms), 'and rooted at atom: ',
info[key][0][0])*
*smiles =
Chem.MolFragmentToSmiles(m,atomsToUse=list(atoms),bondsToUse=env)*
*print('Smiles is: ',smiles)*

for certain morgan bits such as 994485099 I get the smiles as
*c.c.c.O.O.O.ccc*. Which looks a bit weird as it has many non connected
fragments. Please see a screen capture below. Although the DrawMorganBits
produces the correct image.

[image: image.png]

Any suggestions where I might be making mistakes?

Thanks in Advance.

Bilal
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] [ext] Re: Taylor-Butina clustering

2021-07-21 Thread Volkamer, Andrea
Hi Francesca,


adding  to David's comment, we do have some material for beginners that also 
covers and applies Butina clustering that may be useful: 
https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T005_compound_clustering/talktorial.ipynb


Best, Andrea




Prof. Dr. Andrea Volkamer

In silico Toxicology and Structural Bioinformatics,
Institute of Physiology, Charité Universitätsmedizin Berlin

Campus Mitte: Virchowweg 6, 10117 Berlin
Phone: +49 30 - 450 528 504
E-Mail: andrea.volka...@charite.de



Von: David Cosgrove 
Gesendet: Mittwoch, 21. Juli 2021 14:01:03
An: Francesca Magarotto - francesca.magarot...@studio.unibo.it
Cc: RDKit Discuss
Betreff: [ext] Re: [Rdkit-discuss] Taylor-Butina clustering

Hi Francesca,

The Taylor-Butina clustering is not hierarchical.  It is a type of sphere 
exclusion algorithm.  A useful image for the results would be the "centroid" of 
each cluster, possibly followed by the other cluster members.  You will need to 
generate the images from the original input molecules, not the fingerprints.   
You'll need to write some extra code to read the clusters and do this.  The 
Getting Started document 
(https://www.rdkit.org/docs/GettingStartedInPython.html) should help you with 
the image generation.  Technically, the centroids aren't proper centroids, they 
are the molecules that each cluster is based on.  The true centroid would be 
some sort of average of the fingerprints of the molecules in the cluster, which 
itself would not be a molecule.  Dealing with false singletons is a matter of 
taste, as they are an artifact of the clustering method.  One way I have had 
success with in the past is to define a second, looser, similarity threshold 
and put each false singleton into the cluster whose centroid it is most similar 
to, so long as it is within this new threshold.  False singletons are certainly 
more common than true ones in my experience.
The threshold you use for the clustering should be chosen with some care, and 
will depend on the fingerprint type more than anything else.  Greg did a blog 
post recently 
(https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html)
 on selecting a threshold for similarity searching, and those suggestions are 
probably a good place to start with for this, too.

Best,
Dave


On Wed, Jul 21, 2021 at 8:58 AM Francesca Magarotto - 
francesca.magarot...@studio.unibo.it
 
mailto:francesca.magarot...@studio.unibo.it>>
 wrote:
Hi,
I managed to performe Taylor-Butina clustering on a dataset of 193 571 
fragments retrieved from ZINC20.
I used the indications in this link 
https://www.macinchem.org/reviews/clustering/clustering.php
Actually, I've never used RDKit before and never did a cluster analysis, so I'm 
really new to this type of work. I've read the paper related to Taylor-Butina 
clustering (https://pubs.acs.org/doi/10.1021/ci9803381), but I don't understand 
if it can be considered a hierarchical method or not.
Could someone help me understanding this?
Moreover, I've got some problems generating the images after clustering.
First, I don't know what images I need: if it's hierarchical I should do a 
dendrogram, but if it isn't hierchical there's no need (I think).
I only managed to obtain the image of a sparse similarity matrix, but the RAM 
is too small to obtain a dense matrix.
I wasn't able to do the plot of the clusters or to obtain the images of the 
moleculese that are centroids or false singletons (I've tried using RDKit to 
obtain images from fingerprints but the images of the molecules are strange). I 
have thousands of clusters and false singletons as results.
Has someone done something like that in the past? Any suggestions?
I gave me an explanation of what are false and true singletons (I obtain only 
false singletons, is that normal?), but I appreciate if someone more expert 
could explain me and confirm my guess.
I'm sorry for all this questions, but I'm really new to this topic.
Hope someone can help me,
kind regards.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Taylor-Butina clustering

2021-07-21 Thread David Cosgrove
Hi Francesca,

The Taylor-Butina clustering is not hierarchical.  It is a type of sphere
exclusion algorithm.  A useful image for the results would be the
"centroid" of each cluster, possibly followed by the other cluster
members.  You will need to generate the images from the original input
molecules, not the fingerprints.   You'll need to write some extra code to
read the clusters and do this.  The Getting Started document (
https://www.rdkit.org/docs/GettingStartedInPython.html) should help you
with the image generation.  Technically, the centroids aren't proper
centroids, they are the molecules that each cluster is based on.  The true
centroid would be some sort of average of the fingerprints of the molecules
in the cluster, which itself would not be a molecule.  Dealing with false
singletons is a matter of taste, as they are an artifact of the
clustering method.  One way I have had success with in the past is to
define a second, looser, similarity threshold and put each false singleton
into the cluster whose centroid it is most similar to, so long as it is
within this new threshold.  False singletons are certainly more common than
true ones in my experience.
The threshold you use for the clustering should be chosen with some care,
and will depend on the fingerprint type more than anything else.  Greg did
a blog post recently (
https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html)
on selecting a threshold for similarity searching, and those suggestions
are probably a good place to start with for this, too.

Best,
Dave


On Wed, Jul 21, 2021 at 8:58 AM Francesca Magarotto -
francesca.magarot...@studio.unibo.it 
wrote:

> Hi,
> I managed to performe Taylor-Butina clustering on a dataset of 193 571
> fragments retrieved from ZINC20.
> I used the indications in this link
> https://www.macinchem.org/reviews/clustering/clustering.php
> Actually, I've never used RDKit before and never did a cluster analysis,
> so I'm really new to this type of work. I've read the paper related to
> Taylor-Butina clustering (https://pubs.acs.org/doi/10.1021/ci9803381),
> but I don't understand if it can be considered a hierarchical method or not.
> Could someone help me understanding this?
> Moreover, I've got some problems generating the images after clustering.
> First, I don't know what images I need: if it's hierarchical I should do a
> dendrogram, but if it isn't hierchical there's no need (I think).
> I only managed to obtain the image of a sparse similarity matrix, but the
> RAM is too small to obtain a dense matrix.
> I wasn't able to do the plot of the clusters or to obtain the images of
> the moleculese that are centroids or false singletons (I've tried using
> RDKit to obtain images from fingerprints but the images of the molecules
> are strange). I have thousands of clusters and false singletons as results.
> Has someone done something like that in the past? Any suggestions?
> I gave me an explanation of what are false and true singletons (I obtain
> only false singletons, is that normal?), but I appreciate if someone more
> expert could explain me and confirm my guess.
> I'm sorry for all this questions, but I'm really new to this topic.
> Hope someone can help me,
> kind regards.
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Javascript MinimalLib

2021-07-21 Thread David Cosgrove
Brilliant, thanks. I will take note of how to do it myself in future .

Best,
Dave


On Wed, 21 Jul 2021 at 12:32, Greg Landrum  wrote:

> Hi Dave,
>
> It's not in the JS interface yet, but I'll add it now.
>
> -greg
>
>
> On Mon, Jul 19, 2021 at 4:57 PM David Cosgrove 
> wrote:
>
>> Hi,
>>
>> In this blogpost
>> https://greglandrum.github.io/rdkit-blog/technical/2021/05/01/rdkit-cffi-part1.html,
>> Greg mentions the CFFI function get_json().  Is that exposed in the JS
>> MinimalLIb, and if so, how would I use it?  I see all sorts of good stuff
>> in cffiwrapper.h, but I can't work out how to call them from JS.
>>
>> Thanks,
>> Dave
>>
>>
>> --
>> David Cosgrove
>> Freelance computational chemistry and chemoinformatics developer
>> http://cozchemix.co.uk
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> --
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Javascript MinimalLib

2021-07-21 Thread Greg Landrum
Hi Dave,

It's not in the JS interface yet, but I'll add it now.

-greg


On Mon, Jul 19, 2021 at 4:57 PM David Cosgrove 
wrote:

> Hi,
>
> In this blogpost
> https://greglandrum.github.io/rdkit-blog/technical/2021/05/01/rdkit-cffi-part1.html,
> Greg mentions the CFFI function get_json().  Is that exposed in the JS
> MinimalLIb, and if so, how would I use it?  I see all sorts of good stuff
> in cffiwrapper.h, but I can't work out how to call them from JS.
>
> Thanks,
> Dave
>
>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search for an aldehyde returns ketones and acids

2021-07-21 Thread Greg Landrum
Yeah, this is exactly the case where using qmol_from_ctab() should help.

Below is a short example demonstrating this by querying my local ChEMBL
instance. Notice that the first form of the query, which uses
mol_from_ctab() matches what you describe: the results include amides,
esters, etc. The second query, which uses qmol_from_ctab(), only returns
molecules which have a ketone.

I hope this helps,
-greg

chembl_28=# select * from rdk.mols where m@>mol_from_ctab('aldehyde query
  MJ192500

  4  3  0  0  0  0  0  0  0  0999 V2000
   -2.81231.55080. C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52671.13830. C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.24121.55080. H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52670.31330. O   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  2  4  2  0  0  0  0
  2  3  1  0  0  0  0
M  END
') limit 5;
 molregno |   m
--+
   310993 | O=C(NO)c1cc(CS(=O)(=O)c2ccc(Cl)cc2)on1
   310992 | O=C(NO)c1cc(CS(=O)(=O)c2(Cl)c2)on1
   318822 | CCC(NC(=O)C[C@H](N)C(=O)N1CCC[C@H]1C#N)c1c1
   310016 | O=C(CCNC(=O)c1c1)NC1CCN(Cc2ccc(Cl)cc2)C1
   319381 | CCOC(=O)/C=C/c1ccc(CN(C(=O)C2C2)c2(/C=C/C(=O)OC)c2)cc1
(5 rows)

chembl_28=# select * from rdk.mols where m@>qmol_from_ctab('aldehyde query
  MJ192500

  4  3  0  0  0  0  0  0  0  0999 V2000
   -2.81231.55080. C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52671.13830. C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.24121.55080. H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52670.31330. O   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  2  4  2  0  0  0  0
  2  3  1  0  0  0  0
M  END
') limit 5;
 molregno |
m

--+
   284772 | COC(=O)NC1[C@H](C)O[C@@H](O[C@H]2C/C=C(\C)[C@@H]3C=C[C@@H]4[C@
@H](O)[C@@H](C)C[C@H](C)[C@H]4[C@]3(C)/C(O)=C3\C(=O)O[C@]4(CC(C=O)=C[C@H
](OC(C)=O)[C@H]4/C=C\2C)C3=O)CC1(C)[N+](=O)[O-]
   284633 | COC(=O)NC1[C@H](C)O[C@@H](O[C@H]2C/C=C(\C)[C@@H]3C=C[C@@H]4[C@
@H](O[C@H]5O5)[C@@H](C)C[C@H](C)[C@H]4[C@]3(C)/C(O)=C3\C(=O)O[C@
]4(CC(C=O)=C[C@H](OC(C)=O)[C@H]4/C=C\2C)C3=O)CC1(C)[N+](=O)[O-]
   284865 | COC(=O)NC1[C@H](C)O[C@@H](O[C@H]2C/C=C(\C)[C@@H]3C=C[C@@H]4[C@
@H](OCc5ccc(OC)cc5)[C@@H](C)C[C@H](C)[C@H]4[C@]3(C)/C(O)=C3\C(=O)O[C@
]4(CC(C=O)=C[C@H](OC(C)=O)[C@H]4/C=C\2C)C3=O)CC1(C)[N+](=O)[O-]
   299586 | CC1(C)C2CC[C@]3(C)C(CC=C4C5CC(C)(C)[C@@H](OC(=O)c6c6)[C@H
](OC(=O)/C=C/c6c6)[C@]5(C=O)[C@H](O)C[C@]43C)[C@@]2(C)CC[C@@H]1O
   317613 | Cn1cncc1C=O
(5 rows)



On Tue, Jul 20, 2021 at 11:55 PM Webster Homer <
webster.ho...@milliporesigma.com> wrote:

> I should have included the query. It looks like RD Kit is ignoring the H
> atom
>
> The user put in an explicit H
>
> ===MOL file after this
>
> aldehyde query
>
>   MJ192500
>
>
>
>   4  3  0  0  0  0  0  0  0  0999 V2000
>
>-2.81231.55080. C   0  0  0  0  0  0  0  0  0  0  0  0
>
>-3.52671.13830. C   0  0  0  0  0  0  0  0  0  0  0  0
>
>-4.24121.55080. H   0  0  0  0  0  0  0  0  0  0  0  0
>
>-3.52670.31330. O   0  0  0  0  0  0  0  0  0  0  0  0
>
>   2  1  1  0  0  0  0
>
>   2  4  2  0  0  0  0
>
>   2  3  1  0  0  0  0
>
> M  END
>
> =MOL file above this
>
>
>
>
>
> *From:* Greg Landrum 
> *Sent:* Friday, July 16, 2021 11:38 PM
> *To:* Webster Homer 
> *Cc:* rdkit-discuss@lists.sourceforge.net
> *Subject:* Re: [Rdkit-discuss] Substructure search for an aldehyde
> returns ketones and acids
>
>
>
> *[WARNING – EXTERNAL EMAIL]* Do not open links or attachments unless you
> recognize the sender of this email. If you are unsure please click the
> button "Report suspicious email"
>
>
>
> Hi Webster,
>
>
>
> Without seeing an actual query I am inclined to believe that it’s not a
> bug. The problem is more likely a query which has not been drawn explicitly
> or an easily made mistake in the way the cartridge is being used.
>
>
>
> Assuming that the aldehyde queries have been drawn with an explicit H atom
> connected to the C (apologies for not showing this, I’m on my phone and
> don’t have a sketcher available), you should be calling the cartridge
> function qmol_from_ctab(), not mol_from_ctab(), before doing the query.
> qmol_from_ctab() will use the H to help define the query.
>
>
>
> If you’re doing this and still seeing incorrect search results, please
> share a query and the way you’re doing the search and we can try to help
> (or diagnose the bug if there is one)
>
>
>
> Best,
>
> -greg
>
>
>
>
>
> On Fri, 16 Jul 2021 at 17:53, Webster Homer <
> webster.ho...@milliporesigma.com> wrote:
>
> We use RDKit Postgresql 

[Rdkit-discuss] Taylor-Butina clustering

2021-07-21 Thread Francesca Magarotto - francesca.magarot...@studio.unibo.it
Hi,
I managed to performe Taylor-Butina clustering on a dataset of 193 571 
fragments retrieved from ZINC20.
I used the indications in this link 
https://www.macinchem.org/reviews/clustering/clustering.php
Actually, I've never used RDKit before and never did a cluster analysis, so I'm 
really new to this type of work. I've read the paper related to Taylor-Butina 
clustering (https://pubs.acs.org/doi/10.1021/ci9803381), but I don't understand 
if it can be considered a hierarchical method or not.
Could someone help me understanding this?
Moreover, I've got some problems generating the images after clustering.
First, I don't know what images I need: if it's hierarchical I should do a 
dendrogram, but if it isn't hierchical there's no need (I think).
I only managed to obtain the image of a sparse similarity matrix, but the RAM 
is too small to obtain a dense matrix.
I wasn't able to do the plot of the clusters or to obtain the images of the 
moleculese that are centroids or false singletons (I've tried using RDKit to 
obtain images from fingerprints but the images of the molecules are strange). I 
have thousands of clusters and false singletons as results.
Has someone done something like that in the past? Any suggestions?
I gave me an explanation of what are false and true singletons (I obtain only 
false singletons, is that normal?), but I appreciate if someone more expert 
could explain me and confirm my guess.
I'm sorry for all this questions, but I'm really new to this topic.
Hope someone can help me,
kind regards.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss