[Rdkit-discuss] Stable format for long-term storage

2018-10-05 Thread Eric Jonas
Hello! Is there a recommended stable format for long-term storage of RDKit
molecules? Will ToBinary() give me what I need? (the documentation /
purpose seems to be a bit... spartan)  I'd like to save topology,
conformers, and properties (at the atom, bond, and molecule level) to disk
a format that is likely to persist across RDKit versions / other libraries.
I've had bad experiences historically with pickle, which is not recommended
for long-term storage.

Thanks,

...Eric
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDK5 fingerprint

2018-10-05 Thread Greg Landrum
I think Nils is right here. An RDKit fingerprint with a max length of 12 is
going to set A LOT of bits. Try it and see.
Collisions are almost guaranteed

There are many possible reasons why you may not be getting the results you
expect (that’s the fun in machine learning), but if you suspect that the
fingerprints are the problem, you might try another FP and see if you miss
the same compounds. If so: maybe it’s the data. If not: could be the
different info in the different FPs and you could try combining them. We
did a paper on this:
https://pubs.acs.org/doi/abs/10.1021/ci400466r

There are many things to try... one never runs out of new approaches. :-)

On Thu, 4 Oct 2018 at 21:06, Nils Weskamp  wrote:

> Am 04.10.2018 um 20:53 schrieb Thomas Evangelidis:
> > not sure if significantly longer path lengths (e.g. 12) actually
> > "increase the amount of information" since they also increase the
> risk
> > of bit collisions in folded fingerprints.
> >
> > If you increase the fpSize to 8192, won't you reduce the risk of bit
> > collisions?
>
> Yes, by a factor of two. However, depending on the size and complexity
> of your compounds, I would expect that the number of bits growths
> significantly more (due to combinatorial explosion) when you go from
> path length 5 (or 7) to 12.
>
> Best,
> Nils
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] canonical atom mapping

2018-10-05 Thread Greg Landrum
Thanks Paolo,

That would be a really nice contribution to the RDKit Cookbook. hint hint.
:-)

-greg


On Fri, Oct 5, 2018 at 12:20 AM Paolo Tosco 
wrote:

> Hi Eric,
>
> I may be a bit late, but here's a gist that shows how to convert a
> molecule in XYZ format to an RDKit molecule with bond orders taken from
> SMILES retaining hydrogens if originally present.
>
> https://gist.github.com/ptosco/4844d3635cf14d11e5e14381993915c1
>
> Disclaimer: I have tested it on the single molecule that I used as example.
>
> Cheers,
> p.
>
> On 10/04/18 22:36, Eric Jonas wrote:
>
> Thanks both Patrick and Maria, this is incredibly helpful. Is there any
> easy way to inside of RDKit go from a list of molecules to a pdb file?
> Right now I'm writing an .xyz file and doing a xyz->pdb pass via a
> command-line callout to openbabel, but that's a bit of a hack.
>
> On Thu, Oct 4, 2018 at 1:37 PM Patrick Walters 
> wrote:
>
>> I just wrote a blog post on this topic.
>>
>>
>> https://practicalcheminformatics.blogspot.com/2018/09/assigning-bond-orders-to-pdb-ligands.html
>>
>>
>> On Thu, Oct 4, 2018 at 3:35 PM MARIA BRANDL via Rdkit-discuss <
>> rdkit-discuss@lists.sourceforge.net> wrote:
>>
>>> Hello Eric,
>>>
>>> RDKit can assign bond orders from a smiles string to a PDB coordinate
>>> file.
>>> See chapter on 3D functionality in
>>> https://www.rdkit.org/docs/Cookbook.html.
>>>
>>> I assume that the coordinates need to have connectivity (but not  bond
>>> order) information,
>>> which should not be too hard to compute as long as your coordinates are
>>> not too distorted.
>>>
>>> Hope this helps,
>>> Best wishes
>>>
>>> Maria Brandl
>>> On Thursday, 4 October 2018, 15:49:10 BST, Eric Jonas <
>>> jo...@ericjonas.com> wrote:
>>>
>>>
>>> Hello! I have a large database of molecules where I have, for each
>>> molecule:
>>>
>>> 1. A geometry (element, x, y, z)
>>> 2. a SMILES string
>>>
>>> and I would like to get the associated Chem.Mol structure. Of course,
>>> there's no guarantee that the ordering of the atoms in the Mol will match
>>> up with the order in my geometry list. I'm trying to figure out the right
>>> way of turning this geometry into a valid conformation for my Mol. Is there
>>> any reliable way of doing this? I know that going strictly from
>>> geometry->Mol can be challenging (OBabel has some reasonable support, as
>>> does the python library xyz2mol ) but those tools seem to depend largely on
>>> bond-length heuristics, and it seems if I have the SMILES string it should
>>> be possible to do a better job.
>>>
>>> Thanks,
>>>
>>> ...Eric
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>> 
>>>  Virus-free.
>>> www.avast.com
>>> 
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>
>
>
> ___
> Rdkit-discuss mailing 
> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] number of significant digits in molblock?

2018-10-05 Thread Michal Krompiec
6 digits seems perfectly fine for me.

On Fri, 5 Oct 2018 at 14:26, Greg Landrum  wrote:

>
> On Fri, Oct 5, 2018 at 2:42 PM Ivan Tubert-Brohman <
> ivan.tubert-broh...@schrodinger.com> wrote:
>
>> In the newer "V3000", the atom line is not column-based, which I believe
>> gives more freedom to implementers to decide the precision of the
>> coordinates. You can force RDKit to write in this format by calling
>> SetForceV3000(True) on your writer object. I tried it and I get 5 digits
>> after the decimal point instead of 4, so at least that's a start. Looking
>> at the RDKit code (function GetV3000MolFileAtomLine), it just writes the
>> coordinates without setting the precision, so what you get is the default
>> stringstream conversion. Here's where one could in principle adjust this
>> precision, but there's clearly no API to do so at the moment.
>>
>
> Yep. This is not currently possible without editing C++ code.
> If there is a real use case for having more than 6 sig figs for atomic
> positions (this is what is currently available), we can certainly come up
> with a way to make it happen. I don't recall having seen any real-world
> examples where that would be desirable.
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] number of significant digits in molblock?

2018-10-05 Thread Greg Landrum
On Fri, Oct 5, 2018 at 2:42 PM Ivan Tubert-Brohman <
ivan.tubert-broh...@schrodinger.com> wrote:

> In the newer "V3000", the atom line is not column-based, which I believe
> gives more freedom to implementers to decide the precision of the
> coordinates. You can force RDKit to write in this format by calling
> SetForceV3000(True) on your writer object. I tried it and I get 5 digits
> after the decimal point instead of 4, so at least that's a start. Looking
> at the RDKit code (function GetV3000MolFileAtomLine), it just writes the
> coordinates without setting the precision, so what you get is the default
> stringstream conversion. Here's where one could in principle adjust this
> precision, but there's clearly no API to do so at the moment.
>

Yep. This is not currently possible without editing C++ code.
If there is a real use case for having more than 6 sig figs for atomic
positions (this is what is currently available), we can certainly come up
with a way to make it happen. I don't recall having seen any real-world
examples where that would be desirable.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] number of significant digits in molblock?

2018-10-05 Thread Ivan Tubert-Brohman
Hi Michal,

The old SDF format (aka V2000 CTAB) is column-based, as things often were
in the era of Fortran 77 and punch cards. Not only the precision but also
the exact position of each value on the line is specified! Here's what the
spec says:

The Atom Block is made up of atom lines, one line per atom with the
following format:

x.y.z. aaaddcccssshhhbbbvvvHHHrrriiimmmnnneee

which explains why you see four digits after the decimal point. Also note
that in a huge blow to readability, no spaces are required between the
coordinates; if you have coordinates with five digits before the decimal
point, the numbers run into each other, and if you have even more digits,
the number doesn't even fit! There are also limits in the number of atoms
for similar reasons. But I digress...

In the newer "V3000", the atom line is not column-based, which I believe
gives more freedom to implementers to decide the precision of the
coordinates. You can force RDKit to write in this format by calling
SetForceV3000(True) on your writer object. I tried it and I get 5 digits
after the decimal point instead of 4, so at least that's a start. Looking
at the RDKit code (function GetV3000MolFileAtomLine), it just writes the
coordinates without setting the precision, so what you get is the default
stringstream conversion. Here's where one could in principle adjust this
precision, but there's clearly no API to do so at the moment.

Hope this helps,
Ivan


On Fri, Oct 5, 2018 at 5:44 AM Michal Krompiec 
wrote:

> Hello,
> Is it possible to control the number of significant digits of XYZ
> coordinates? I am modifying coordinates of my molecules
> using SetAtomPosition but when I save them into an SDF it seems that the
> precision is limited to 4 digits after the decimal point (I'd like 10
> instead...).
> Best wishes,
> Michal
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] number of significant digits in molblock?

2018-10-05 Thread Michal Krompiec
Hi Jan,
Thanks, 6 digits is OK! Forcing V3000 did the trick:
sdf_out=Chem.SDWriter(outfile)
sdf_out.SetForceV3000(True)

Best,
Michal

On Fri, 5 Oct 2018 at 12:59, Jan Holst Jensen  wrote:

> Hi Michal,
>
> V2000 format is restricted by its specification to fixed format with 4
> decimals. V3000 output is not restricted to a fixed format, but the current
> code still rounds it in practice as seen below.
>
> To get extra precision you could change the formatting of x, y, and z
> coordinate output in Code/GraphMol/FileParsers/MolFileWriter.cpp, function 
> GetV3000MolFileAtomLine(),
> look for the
>
> ss << " " << x << " " << y << " " << z;
>
> line. Adding extra digits to the X, Y, and Z coordinates *should* not
> cause issues for compliant V3000 readers.
>
> Cheers
> -- Jan
>
> >>> import rdkit
> >>> from rdkit import Chem
> >>> from Chem import AllChem
> >>> m = Chem.MolFromSmiles('CC')
> >>> AllChem.Compute2DCoords(m)
> 0
> >>> m.GetConformer(0).SetAtomPosition(0,
> rdkit.Geometry.Point3D(0.123456789, 0.2, 0.3))
> >>>
> print(Chem.MolToMolBlock(m))
>  RDKit  2D
>
>   2  1  0  0  0  0  0  0  0  0999 V2000
> 0.12350.20000.3000 C   0  0  0  0  0  0  0  0  0  0  0  0
> <== 4 decimal digits
> 0.7500   -0.0. C   0  0  0  0  0  0  0  0  0  0  0  0
>   1  2  1  0
> M  END
>
> >>> print(Chem.MolToMolBlock(m, forceV3000=True))
>
>  RDKit  2D
>
>   0  0  0  0  0  0  0  0  0  0999 V3000
> M  V30 BEGIN CTAB
> M  V30 COUNTS 2 1 0 0 0
> M  V30 BEGIN ATOM
> M  V30 1 C 0.123457 0.2 0.3 0<== 6 decimal digits
> M  V30 2 C 0.75 -5.55112e-17 0 0
> M  V30 END ATOM
> M  V30 BEGIN BOND
> M  V30 1 1 1 2
> M  V30 END BOND
> M  V30 END CTAB
> M  END
>
> >>>
>
> On 2018-10-05 11:42, Michal Krompiec wrote:
>
> Hello,
> Is it possible to control the number of significant digits of XYZ
> coordinates? I am modifying coordinates of my molecules
> using SetAtomPosition but when I save them into an SDF it seems that the
> precision is limited to 4 digits after the decimal point (I'd like 10
> instead...).
> Best wishes,
> Michal
>
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] number of significant digits in molblock?

2018-10-05 Thread Jan Holst Jensen

Hi Michal,

V2000 format is restricted by its specification to fixed format with 4 
decimals. V3000 output is not restricted to a fixed format, but the 
current code still rounds it in practice as seen below.


To get extra precision you could change the formatting of x, y, and z 
coordinate output in Code/GraphMol/FileParsers/MolFileWriter.cpp, 
function GetV3000MolFileAtomLine(), look for the


    ss << " " << x << " " << y << " " << z;

line. Adding extra digits to the X, Y, and Z coordinates *should* not 
cause issues for compliant V3000 readers.


Cheers
-- Jan

>>> import rdkit
>>> from rdkit import Chem
>>> from Chem import AllChem
>>> m = Chem.MolFromSmiles('CC')
>>> AllChem.Compute2DCoords(m)
0
>>> m.GetConformer(0).SetAtomPosition(0, 
rdkit.Geometry.Point3D(0.123456789, 0.2, 0.3))

>>> print(Chem.MolToMolBlock(m))
 RDKit  2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.1235    0.2000    0.3000 C   0  0  0  0  0  0  0 0  0  0  0  0    
<== 4 decimal digits

    0.7500   -0.    0. C   0  0  0  0  0  0  0 0  0  0  0  0
  1  2  1  0
M  END

>>> print(Chem.MolToMolBlock(m, forceV3000=True))

 RDKit  2D

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 2 1 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0.123457 0.2 0.3 0    <== 6 decimal digits
M  V30 2 C 0.75 -5.55112e-17 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 END BOND
M  V30 END CTAB
M  END

>>>

On 2018-10-05 11:42, Michal Krompiec wrote:

Hello,
Is it possible to control the number of significant digits of XYZ 
coordinates? I am modifying coordinates of my molecules 
using SetAtomPosition but when I save them into an SDF it seems that 
the precision is limited to 4 digits after the decimal point (I'd like 
10 instead...).

Best wishes,
Michal




smime.p7s
Description: S/MIME Cryptographic Signature
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] number of significant digits in molblock?

2018-10-05 Thread Michal Krompiec
Hello,
Is it possible to control the number of significant digits of XYZ
coordinates? I am modifying coordinates of my molecules
using SetAtomPosition but when I save them into an SDF it seems that the
precision is limited to 4 digits after the decimal point (I'd like 10
instead...).
Best wishes,
Michal
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss