[Rdkit-discuss] boron compound not recognized by RDkit

2018-09-25 Thread Bennion, Brian via Rdkit-discuss
Hello,
Awhile back I had noticed that rdkit has issues with boron containing 
compounds.  One is below, and I admit it is a strange one. I read in an sdf 
file and write it out after calculating a formal charge on the molecule.
It seems to be read into rdkit ok but writing errored out with "ValueError: 
could not find number of expected rings."
I think it odd that the compound can be read in, but not written out.  Should I 
just ignore this molecule?
Brian




OpenBabel08161816583D

12 30  0  0  0  0  0  0  0  0999 V2000
0.7000   -4.9240   -0.0370 B   0  0  0  0  0  0  0  0  0  0  0  0
1.5320   -2.2270   -0.0390 B   0  0  0  0  0  0  0  0  0  0  0  0
0.0470   -3.3430   -0.0100 B   0  0  0  0  0  0  0  0  0  0  0  0
3.9570   -0.6710   -0.0740 B   0  0  0  0  0  0  0  0  0  0  0  0
   -1.31000.94600.0290 B   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7380   -1.00100.0270 B   0  0  0  0  0  0  0  0  0  0  0  0
3.8030   -1.9300   -0.2150 B   0  0  0  0  0  0  0  0  0  0  0  0
2.05600.0860   -0.0260 B   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7320   -1.1240   -0.0280 B   0  0  0  0  0  0  0  0  0  0  0  0
   -1.8540   -2.5110   -0.1560 B   0  0  0  0  0  0  0  0  0  0  0  0
0.80301.37400.1040 C   0  0  0  0  0  0  0  0  0  0  0  0
1.2660   -0.0510   -0.0220 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  1  0  0  0  0
  1  5  1  0  0  0  0
  1  6  1  0  0  0  0
  2  3  1  0  0  0  0
  2  4  1  0  0  0  0
  2  7  1  0  0  0  0
  2  8  1  0  0  0  0
  3  6  1  0  0  0  0
  3  7  1  0  0  0  0
  3 10  1  0  0  0  0
  4  5  1  0  0  0  0
  4  8  1  0  0  0  0
  4  9  1  0  0  0  0
  5  6  1  0  0  0  0
  5  9  1  0  0  0  0
  5 12  1  0  0  0  0
  6 10  1  0  0  0  0
  6 12  1  0  0  0  0
  7  8  1  0  0  0  0
  7 10  1  0  0  0  0
  7 11  1  0  0  0  0
  8  9  1  0  0  0  0
  8 11  1  0  0  0  0
  9 11  1  0  0  0  0
  9 12  1  0  0  0  0
10 11  1  0  0  0  0
10 12  1  0  0  0  0
11 12  1  0  0  0  0
M  END




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Peter S. Shenkin
Well, I'm not really familiar with the Taylor-Butina clustering method, so
I'm proposing a methodology based on generalizing something that I found to
be useful in a somewhat different clustering context.

Presuming that what you are clustering is the fingerprints of structures,
and that you know which structures are in each cluster, you'd compute the
average of all the fingerprints. That is, each bit position would be given
a floating point number that is the average of the 0s and 1s at that
position computed over the structures in the cluster.  Then you'd compute
the distance (say, Manhattan or Euclidian) between the fingerprint of each
structure in the cluster and the average so computed. The "most
representative structure" would be the cluster member whose distance is
closest to the cluster's average fingerprint. (Some additional mileage
could be gained by seeing just how far away from the averag the "most
representative structures" are. It might be more representative (i.e.,
closer) for some clusters than for others.

It would make sense to try this (since it's easy enough) and see whether
the resulting "most representative structures" from the clusters really are
at least roughly representative, by comparing them with viewable random
subsets of structures from the clusters.

-P.

On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke 
wrote:

> On Sep 25, 2018, at 17:13, Peter S. Shenkin  wrote:
> > FWIW, in work on conformational clustering, I used the “most
> representative” molecule; that is, the real molecule closest to the
> mathematical centroid. This would probably be the best way of displaying a
> single molecule that typifies what is in the cluster.
>
> In some sense I'm rephrasing Chris Earnshaw's earlier question - how does
> one do that with Taylor-Butina clustering? And does it make sense?
>
> The algorithm starts by picking a centroid based on the fingerprints with
> the highest number of neighbors, so none of the other cluster members
> should have more neighbors within that cutoff.
>
> I am far from an expert on this topic, but with any alternative I can
> think of makes me think I should have started with something other than
> Taylor-Butina.
>
>
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Andrew Dalke
On Sep 25, 2018, at 17:13, Peter S. Shenkin  wrote:
> FWIW, in work on conformational clustering, I used the “most representative” 
> molecule; that is, the real molecule closest to the mathematical centroid. 
> This would probably be the best way of displaying a single molecule that 
> typifies what is in the cluster. 

In some sense I'm rephrasing Chris Earnshaw's earlier question - how does one 
do that with Taylor-Butina clustering? And does it make sense?

The algorithm starts by picking a centroid based on the fingerprints with the 
highest number of neighbors, so none of the other cluster members should have 
more neighbors within that cutoff.

I am far from an expert on this topic, but with any alternative I can think of 
makes me think I should have started with something other than Taylor-Butina.



Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Saving mol file

2018-09-25 Thread Colin Bournez
Well yes I have this line indeed, I did not put the whole file for 
clarity purpose. The thing is tools as MOE, Pymol read it without 
problem but RDock for example can't read it properly and returns a 
neutral N which is not the case. And if I open it with pymol and save it 
back in mol format, the 3 appears on the N line and Rdock has no trouble 
anymore...
I was just wondering if there was a trick in RDKit to also save it this 
way.



On 25/09/18 17:18, Greg Landrum wrote:

Hi Colin,
The RDkit outputs charge information to mol blocks using the CHG line:

In [3]: m = Chem.MolFromSmiles('C[NH3+]')

In [4]: print(Chem.MolToMolBlock(m))

 RDKit  2D

  2  1  0  0  0  0  0  0  0  0999 V2000
0.0.0. C   0  0  0  0  0  0 0  0  0  0  0  0
1.29900.75000. N   0  0  0  0  0  0 0  0  0  0  0  0
  1  2  1  0
M  CHG  1   2   1
M  END


I expect that you will find one of those in your mol file and that it 
should be properly read in by other tools.

Is this not the case for you?

Best,
-greg



On Tue, Sep 25, 2018 at 4:39 PM Colin Bournez 
mailto:colin.bour...@univ-orleans.fr>> 
wrote:


Hey everyone,

I have a question concerning the Chem.MolToMolFile() function.
When I open this file containing a N+ (here is the line
corresponding in the mol file) :

   11.37003.4360  -11.8300 N   0  3  0  0  0  0  0  0 0  0  0  0

And I just save it back withotu any modification, the line is then :

 11.37003.4360  -11.8300 N   0  0  0  0  0  0  0  0 0  0  0  0

The problem is that for some software this mol file causes trouble
and the N+ is transformed to N with 4 bonds.
I tried several tricks but I was not able to save it as the
original line, does anyone has suggestion?

Thanks,

-- 
*Colin Bournez*

PhD Student, Structural Bioinformatics & Chemoinformatics
Institut de Chimie Organique et Analytique (ICOA), UMR
CNRS-Université d'Orléans 7311
Rue de Chartres, 45067 Orléans, France
T. +33 238 494 577
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



--
*Colin Bournez*
PhD Student, Structural Bioinformatics & Chemoinformatics
Institut de Chimie Organique et Analytique (ICOA), UMR CNRS-Université 
d'Orléans 7311

Rue de Chartres, 45067 Orléans, France
T. +33 238 494 577
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Saving mol file

2018-09-25 Thread Greg Landrum
Hi Colin,
The RDkit outputs charge information to mol blocks using the CHG line:

In [3]: m = Chem.MolFromSmiles('C[NH3+]')

In [4]: print(Chem.MolToMolBlock(m))

 RDKit  2D

  2  1  0  0  0  0  0  0  0  0999 V2000
0.0.0. C   0  0  0  0  0  0  0  0  0  0  0  0
1.29900.75000. N   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
M  CHG  1   2   1
M  END


I expect that you will find one of those in your mol file and that it
should be properly read in by other tools.
Is this not the case for you?

Best,
-greg



On Tue, Sep 25, 2018 at 4:39 PM Colin Bournez 
wrote:

> Hey everyone,
>
> I have a question concerning the Chem.MolToMolFile() function.
> When I open this file containing a N+ (here is the line corresponding in
> the mol file) :
>
>11.37003.4360  -11.8300 N   0  3  0  0  0  0  0  0  0  0  0  0
>
> And I just save it back withotu any modification, the line is then :
>
>  11.37003.4360  -11.8300 N   0  0  0  0  0  0  0  0  0  0  0  0
>
> The problem is that for some software this mol file causes trouble and the
> N+ is transformed to N with 4 bonds.
> I tried several tricks but I was not able to save it as the original line,
> does anyone has suggestion?
>
> Thanks,
>
> --
> *Colin Bournez*
> PhD Student, Structural Bioinformatics & Chemoinformatics
> Institut de Chimie Organique et Analytique (ICOA), UMR CNRS-Université
> d'Orléans 7311
> Rue de Chartres, 45067 Orléans, France
> T. +33 238 494 577
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Peter S. Shenkin
(I see that I accidentally responded to Andrew, only, earlier; I'm copying
to the group this time.)

FWIW, in work on conformational clustering, I used the “most
representative” molecule; that is, the real molecule closest to the
mathematical centroid. This would probably be the best way of displaying a
single molecule that typifies what is in the cluster.

-P.

On Tue, Sep 25, 2018 at 8:09 AM, Andrew Dalke 
wrote:

> On Sep 21, 2018, at 14:53, Philipp Thiel  tuebingen.de> wrote:
> > you probably read about the Tanimoto being a proper metric in case of
> having binary data
> > in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in
> the revised edition.
>
> What we call Tanimoto is more broadly known as the Jaccard. Various sites
> demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric,
> such as https://mathoverflow.net/questions/18084/is-the-
> jaccard-distance-a-distance and https://arxiv.org/abs/1612.02696 .
>
> Going back to James T. Metz's original question, one alternative might be
> to use chemfp and the Taylor-Butina clustering implementation available at:
>
>   http://dalkescientific.com/writings/taylor_butina.py
>
> Following Dave Cosgrove's advice:
>
> > I expect James means what we used to call the cluster seed, i.e. the
> molecule the cluster was based on, rather than the mathematical centroid.
> Calculating distances from each cluster member to that would be quite
> straightforward as a post-processing step although that would roughly
> double the time taken.
>
> it's possible to change the reporting code from:
>
> for centroid_idx, members in clusters:
> print(arena.ids[centroid_idx], "has", len(members), "other
> members", file=outfile)
> print("=>", " ".join(arena.ids[idx] for idx in members),
> file=outfile)
>
> so it does the post-processing:
>
> print(len(clusters), "clusters", file=outfile)
> for centroid_idx, members in clusters:
> print(arena.ids[centroid_idx], "has", len(members), "other
> members", file=outfile)
> subarena = arena.copy(indices=members)
> centroid_fp = arena.get_fingerprint(centroid_idx)
> result = subarena.threshold_tanimoto_search_fp(centroid_fp,
> threshold=0.0)
> result.reorder()  # sort so the highest scores come first
> for id, score in result.get_ids_and_scores():
> print("=>", id, "score:", score)
>
>
> Cheers,
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Saving mol file

2018-09-25 Thread Colin Bournez

Hey everyone,

I have a question concerning the Chem.MolToMolFile() function.
When I open this file containing a N+ (here is the line corresponding in 
the mol file) :


   11.37003.4360  -11.8300 N   0  3  0  0  0  0  0  0  0  0  0 0

And I just save it back withotu any modification, the line is then :

 11.37003.4360  -11.8300 N   0  0  0  0  0  0  0  0  0  0 0  0

The problem is that for some software this mol file causes trouble and 
the N+ is transformed to N with 4 bonds.
I tried several tricks but I was not able to save it as the original 
line, does anyone has suggestion?


Thanks,

--
*Colin Bournez*
PhD Student, Structural Bioinformatics & Chemoinformatics
Institut de Chimie Organique et Analytique (ICOA), UMR CNRS-Université 
d'Orléans 7311

Rue de Chartres, 45067 Orléans, France
T. +33 238 494 577
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Andrew Dalke
On Sep 21, 2018, at 14:53, Philipp Thiel  
wrote:
> you probably read about the Tanimoto being a proper metric in case of having 
> binary data
> in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the 
> revised edition.

What we call Tanimoto is more broadly known as the Jaccard. Various sites 
demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric, 
such as 
https://mathoverflow.net/questions/18084/is-the-jaccard-distance-a-distance and 
https://arxiv.org/abs/1612.02696 .

Going back to James T. Metz's original question, one alternative might be to 
use chemfp and the Taylor-Butina clustering implementation available at: 

  http://dalkescientific.com/writings/taylor_butina.py

Following Dave Cosgrove's advice: 

> I expect James means what we used to call the cluster seed, i.e. the molecule 
> the cluster was based on, rather than the mathematical centroid. Calculating 
> distances from each cluster member to that would be quite straightforward as 
> a post-processing step although that would roughly double the time taken. 

it's possible to change the reporting code from:

for centroid_idx, members in clusters:
print(arena.ids[centroid_idx], "has", len(members), "other members", 
file=outfile)
print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile)

so it does the post-processing:

print(len(clusters), "clusters", file=outfile)
for centroid_idx, members in clusters:
print(arena.ids[centroid_idx], "has", len(members), "other members", 
file=outfile)
subarena = arena.copy(indices=members)
centroid_fp = arena.get_fingerprint(centroid_idx)
result = subarena.threshold_tanimoto_search_fp(centroid_fp, 
threshold=0.0)
result.reorder()  # sort so the highest scores come first
for id, score in result.get_ids_and_scores():
print("=>", id, "score:", score)


Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Docker image for GSOC2018_MolVS_Integration

2018-09-25 Thread Tim Dudgeon
I was very happy to hear about the integration of MolVS into RDKit core 
in the talk by Susan Leung at the recent UGM.


https://github.com/rdkit/UGM_2018/blob/master/Presentations/Leung_GSoC_RDKit-MolVS_Integration.pdf
This is going to be incredibly useful once it gets released.

To help with testing of this I have created a Docker image based on the 
code on Susan's fork 
(https://github.com/susanhleung/rdkit/tree/dev/GSOC2018_MolVS_Integration) 
which I believe is what is used for the PR on the RDKit repo 
(https://github.com/rdkit/rdkit/pull/2002).


I will try to keep this updated at suitable intervals until the code is 
merged into the main RDKit repo.


To run the Docker image try something like this:

$ docker run -it --rm 
informaticsmatters/rdkit-python-debian:standardizer python

Python 2.7.15+ (default, Aug 31 2018, 11:56:52)
[GCC 8.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from rdkit import Chem
>>> from rdkit.Chem.MolStandardize import rdMolStandardize
>>> m = 
rdMolStandardize.StandardizeSmiles('[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1')

[09:18:41] Initializing MetalDisconnector
[09:18:41] Running MetalDisconnector
[09:18:41] Removed covalent bond between Na and O
[09:18:41] Initializing Normalizer
[09:18:41] Running Normalizer
[09:18:41] Rule applied: SulfonetoS(=O)(=O)
>>> m
'O=C([O-])c1ccc(C[S](=O)=O)cc1.[Na+]'
>>>



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss