Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread Francois Berenger

On 21/09/2018 16:53, Chris Earnshaw wrote:

Hi

I'm afraid I can't help with an RDkit solution to your question, but
there are a couple of issues which should be born in mind:
1) The centroid of a cluster is a vector mean of the fingerprints of
all the members of the cluster and probably will not be represented
_exactly_ by any member of the cluster; in this case no structures
will have a distance of 0.0 from the centroid. Do you want to
calculate the distances from the true centroid or from the
structure(s) closest to the centroid?


I have seen 'clustroid' in the literature to mean
cluster member nearest to the centroid of that cluster.


2) The Tanimoto metric doesn't obey the triangle inequality and is
therefore sub-optimal for this kind of analysis. It's better to use an
alternative which does obey the triangle inequality - e.g. the Cosine
metric.


The opposite is true.

Sven Kosub. A note on the triangle inequality for the jaccard distance.
CoRR, abs/1612.02696, 2016.

Alan H. Lipkus. A proof of the triangle inequality for the tanimoto dis-
tance. Journal of Mathematical Chemistry, 26(1):263–265, Oct 1999.

While cosine similarity is not a metric, according to wikipedia.

I'm not a mathematician, but I think (1 - Tanimoto) is a proper distance
as long as the molecules are encoded with only positive values.
So, Boolean fingerprints are OK, and counted unfolded fingerprints
as well.

Regard,
Francois.


Regards,
Chris Earnshaw

On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss
 wrote:


RDkit Discussion Group,

I note that RDkit can perform Butina clustering. Given an SDF
of
small molecules I would like to cluster the ligands, but obtain
additional
information from the clustering algorithm. In particular, I would
like to obtain
the cluster number and Tanimoto distance from the centroid for every
ligand
in the SDF. The centroid would obviously have a distance of 0.00.

Has anyone written additional RDkit code to extract this
additional information?

Thank you.

Regards,

Jim Metz

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss [1]



Links:
--
[1] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Saving mol file

2018-09-26 Thread GALLY Jose Manuel
Dear Colin,
this is a specific problem I stumbled upon some time ago.[1]

I also mentioned it to the rDock mailing list.[2]

Maybe there is a better work-around, but in the meantime I wrote the
attached function.

It takes as input the Mol Block, which in my case are in a dataframe.

Hope that helps!

Cheers,
Jose Manuel

Refs:
[1] https://sourceforge.net/p/rdkit/mailman/message/34740124/
[2] https://sourceforge.net/p/rdock/mailman/message/34741112/

2018-09-25 17:27 GMT+02:00 Colin Bournez :

> Well yes I have this line indeed, I did not put the whole file for clarity
> purpose. The thing is tools as MOE, Pymol read it without problem but RDock
> for example can't read it properly and returns a neutral N which is not the
> case. And if I open it with pymol and save it back in mol format, the 3
> appears on the N line and Rdock has no trouble anymore...
> I was just wondering if there was a trick in RDKit to also save it this
> way.
>
>
> On 25/09/18 17:18, Greg Landrum wrote:
>
> Hi Colin,
> The RDkit outputs charge information to mol blocks using the CHG line:
>
> In [3]: m = Chem.MolFromSmiles('C[NH3+]')
>
> In [4]: print(Chem.MolToMolBlock(m))
>
>  RDKit  2D
>
>   2  1  0  0  0  0  0  0  0  0999 V2000
> 0.0.0. C   0  0  0  0  0  0  0  0  0  0  0  0
> 1.29900.75000. N   0  0  0  0  0  0  0  0  0  0  0  0
>   1  2  1  0
> M  CHG  1   2   1
> M  END
>
>
> I expect that you will find one of those in your mol file and that it
> should be properly read in by other tools.
> Is this not the case for you?
>
> Best,
> -greg
>
>
>
> On Tue, Sep 25, 2018 at 4:39 PM Colin Bournez <
> colin.bour...@univ-orleans.fr> wrote:
>
>> Hey everyone,
>>
>> I have a question concerning the Chem.MolToMolFile() function.
>> When I open this file containing a N+ (here is the line corresponding in
>> the mol file) :
>>
>>11.37003.4360  -11.8300 N   0  3  0  0  0  0  0  0  0  0  0  0
>>
>> And I just save it back withotu any modification, the line is then :
>>
>>  11.37003.4360  -11.8300 N   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> The problem is that for some software this mol file causes trouble and
>> the N+ is transformed to N with 4 bonds.
>> I tried several tricks but I was not able to save it as the original
>> line, does anyone has suggestion?
>>
>> Thanks,
>>
>> --
>> *Colin Bournez*
>> PhD Student, Structural Bioinformatics & Chemoinformatics
>> Institut de Chimie Organique et Analytique (ICOA), UMR CNRS-Université
>> d'Orléans 7311
>> Rue de Chartres, 45067 Orléans, France
>> T. +33 238 494 577
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
> --
> *Colin Bournez*
> PhD Student, Structural Bioinformatics & Chemoinformatics
> Institut de Chimie Organique et Analytique (ICOA), UMR CNRS-Université
> d'Orléans 7311
> Rue de Chartres, 45067 Orléans, France
> T. +33 238 494 577
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>


-- 
José-Manuel Gally
PhD Student
Structural Bioinformatics & Chemoinformatics
Institut de Chimie Organique et Analytique (ICOA)
UMR CNRS-Université d'Orléans 7311
Université d'Orléans
Rue de Chartres
F-45067 Orléans
phone: +33 238 494 577

def UpdateChargeFlagInAtomBlock(mb):
"""
This function opens twice a file.
During the first time it reads it in order to extract all Mol Blocks
and update them with expected charge flags in atomblocks in memory.
During second time it rewrites it using the updated Mol Blocks in memory.
"""
f="{:>10s}"*3+"{:>2}{:>4s}"+"{:>3s}"*11
chgs = []# list of charges
lines = mb.split("\n")
if mb[0] == '' or mb[0] == "\n":
del lines[0]
CTAB = lines[2]
atomCount = int(CTAB.split()[0])
# parse mb line per line
for l in lines:
# look for M CHG property
if l[0:6] == "M  CHG":
records = l.split()[3:]# M  CHG X is not needed for parsing, the info we want comes afterwards
# record each charge into a list
for i in range(0,len(records),2):
idx = records[i]
chg = records[i+1]
chgs.append((int(idx), int(chg)))# sort tuples by first element?
break# stop iterating

# sort by idx in order to parse the molblock only once more
chgs = sorted(chgs, key=lambda x: x[0])

# that we have a list for the current molblock, attribute each charges
for chg in chgs:
i=3
while i < 3+atomCount:# do not read from beginning each time, rather continue parsing mb!
# when finding the idx of the atom we want to update, extract all fields and rewrite whole sequence
if i-2 == chg[0]:# -4 to take into account the CTAB headers, +1 

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread Andrew Dalke
On Sep 26, 2018, at 20:26, Peter S. Shenkin  wrote:
> Ah, David, but how do you define a "real" singleton?

There can be many different definitions of what a '"real" singleton' might be, 
but we are specifically talking about Butina clustering.

The Butina paper defines the term "false singleton", which Dave quoted. The 
relevant text from DOI: 10.1021/ci9803381 is:

"""The molecules that have not been flagged by the end of the clustering 
process, either as a cluster centroid or as a cluster member, become 
singletons. It is important to emphasize at this stage that one of the 
consequences of this approach is that some molecules defined as singletons may 
have neighbors at the given Tanimoto similarity index, but those neighbors have 
been excluded by a ‘stronger’ cluster centroid, i.e., one with more neighbors 
in its list.  the problem with the creation of a number of false singletons 
that do in fact have similar compounds within the set is easily offset by the 
final quality of the clusters that this approach generates."""

As you can see, there are two types of singletons, and one is called "false 
singleton". No specific name is used for the other type of singleton, but it's 
easy to how they can be called "real" singletons, without confusion or 
misunderstanding.

(FWIW, my implementation, mentioned in an earlier email, uses the term "true 
singleton" as the singleton which is not a "false singleton", but the 
difference is only in the label.)

To confirm that this is what Dave means, I'll quote from his paper 

Blomberg, N., Cosgrove, D. A., Kenny, P. W., & Kolmodin, K. (2009). Design of 
compound libraries for fragment screening. Journal of Computer-Aided Molecular 
Design, 23(8), 513–525. doi:10.1007/s10822-009-9264-5

"""The clustering program flush_clus is an implementation of the 
sphere-exclusion algorithm of Taylor [41], which has also been reported 
independently by Butina ... One consequence of the algorithm is the production 
of ‘false singleton clusters.’ The final clusters in the output are invariably 
singleton clusters, where the only member is the seed. Some of these will be 
true singletons, i.e. molecules lacking neighbors within the clustering 
threshold, but others (the false singletons) will be singletons by virtue of 
the fact that their neighbors were placed in other larger clusters in a 
previous iteration of the algorithm. The flush_clus program offers the 
opportunity of performing a final sweep through the clusters using a larger 
similarity threshold and placing the singleton molecules within the cluster for 
which it has the greatest similarity with the seed, so long as this is within 
the threshold."""

Cheers,


Andrew
da...@dalkescientific.com



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread Peter S. Shenkin
Ah, David, but how do you define a "real" singleton?

-P.

On Wed, Sep 26, 2018 at 1:30 PM David Cosgrove 
wrote:

> Slightly off topic, but a minor issue with the Taylor-Butina algorithm is
> that it generates “false singletons”. These are molecules just outside the
> clustering cutoff that are stranded when their neighbours are put in a
> different, larger cluster. We used to find it convenient to have a sweep of
> these, at a slightly looser cutoff, and drop them into the cluster whose
> centroid/seed they were nearest too. This could be added to Andrew’s code
> quite easily. At the very least, it’s worth keeping track of the initial
> number of neighbours within the cluster cutoff that each fingerprint had so
> as to distinguish real singletons from these artefactual ones.
> Dave
>
>
> On Tue, 25 Sep 2018 at 19:56, Peter S. Shenkin  wrote:
>
>> Well, I'm not really familiar with the Taylor-Butina clustering method,
>> so I'm proposing a methodology based on generalizing something that I found
>> to be useful in a somewhat different clustering context.
>>
>> Presuming that what you are clustering is the fingerprints of structures,
>> and that you know which structures are in each cluster, you'd compute the
>> average of all the fingerprints. That is, each bit position would be given
>> a floating point number that is the average of the 0s and 1s at that
>> position computed over the structures in the cluster.  Then you'd compute
>> the distance (say, Manhattan or Euclidian) between the fingerprint of each
>> structure in the cluster and the average so computed. The "most
>> representative structure" would be the cluster member whose distance is
>> closest to the cluster's average fingerprint. (Some additional mileage
>> could be gained by seeing just how far away from the averag the "most
>> representative structures" are. It might be more representative (i.e.,
>> closer) for some clusters than for others.
>>
>> It would make sense to try this (since it's easy enough) and see whether
>> the resulting "most representative structures" from the clusters really are
>> at least roughly representative, by comparing them with viewable random
>> subsets of structures from the clusters.
>>
>> -P.
>>
>> On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke 
>> wrote:
>>
>>> On Sep 25, 2018, at 17:13, Peter S. Shenkin  wrote:
>>> > FWIW, in work on conformational clustering, I used the “most
>>> representative” molecule; that is, the real molecule closest to the
>>> mathematical centroid. This would probably be the best way of displaying a
>>> single molecule that typifies what is in the cluster.
>>>
>>> In some sense I'm rephrasing Chris Earnshaw's earlier question - how
>>> does one do that with Taylor-Butina clustering? And does it make sense?
>>>
>>> The algorithm starts by picking a centroid based on the fingerprints
>>> with the highest number of neighbors, so none of the other cluster members
>>> should have more neighbors within that cutoff.
>>>
>>> I am far from an expert on this topic, but with any alternative I can
>>> think of makes me think I should have started with something other than
>>> Taylor-Butina.
>>>
>>>
>>>
>>> Andrew
>>> da...@dalkescientific.com
>>>
>>>
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread David Cosgrove
Slightly off topic, but a minor issue with the Taylor-Butina algorithm is
that it generates “false singletons”. These are molecules just outside the
clustering cutoff that are stranded when their neighbours are put in a
different, larger cluster. We used to find it convenient to have a sweep of
these, at a slightly looser cutoff, and drop them into the cluster whose
centroid/seed they were nearest too. This could be added to Andrew’s code
quite easily. At the very least, it’s worth keeping track of the initial
number of neighbours within the cluster cutoff that each fingerprint had so
as to distinguish real singletons from these artefactual ones.
Dave


On Tue, 25 Sep 2018 at 19:56, Peter S. Shenkin  wrote:

> Well, I'm not really familiar with the Taylor-Butina clustering method, so
> I'm proposing a methodology based on generalizing something that I found to
> be useful in a somewhat different clustering context.
>
> Presuming that what you are clustering is the fingerprints of structures,
> and that you know which structures are in each cluster, you'd compute the
> average of all the fingerprints. That is, each bit position would be given
> a floating point number that is the average of the 0s and 1s at that
> position computed over the structures in the cluster.  Then you'd compute
> the distance (say, Manhattan or Euclidian) between the fingerprint of each
> structure in the cluster and the average so computed. The "most
> representative structure" would be the cluster member whose distance is
> closest to the cluster's average fingerprint. (Some additional mileage
> could be gained by seeing just how far away from the averag the "most
> representative structures" are. It might be more representative (i.e.,
> closer) for some clusters than for others.
>
> It would make sense to try this (since it's easy enough) and see whether
> the resulting "most representative structures" from the clusters really are
> at least roughly representative, by comparing them with viewable random
> subsets of structures from the clusters.
>
> -P.
>
> On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke 
> wrote:
>
>> On Sep 25, 2018, at 17:13, Peter S. Shenkin  wrote:
>> > FWIW, in work on conformational clustering, I used the “most
>> representative” molecule; that is, the real molecule closest to the
>> mathematical centroid. This would probably be the best way of displaying a
>> single molecule that typifies what is in the cluster.
>>
>> In some sense I'm rephrasing Chris Earnshaw's earlier question - how does
>> one do that with Taylor-Butina clustering? And does it make sense?
>>
>> The algorithm starts by picking a centroid based on the fingerprints with
>> the highest number of neighbors, so none of the other cluster members
>> should have more neighbors within that cutoff.
>>
>> I am far from an expert on this topic, but with any alternative I can
>> think of makes me think I should have started with something other than
>> Taylor-Butina.
>>
>>
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss