[Rdkit-discuss] PMI API

2017-01-08 Thread Chris Earnshaw
Hi

A while ago I had a project which needed PMI descriptors (specifically NPR1
and NPR2) which were not available in the main branch of RDKit at the time.
At the time I used the fork by 'hahnda6' which provided the
calcPMIDescriptors() function, and this worked well. Now that PMI
descriptors are available in the main RDKit distrubution I thought I'd
rewrite my code to use the official version.

Building the new RDKit was no problem, but things went downhill shortly
after that. There's every chance that I've missed the relevant
documentation (I hope someone can point me in the right direction if so)
and done something stupid!

The issues are -
1) I can't find any documentation of the C++ API - the only reference to
PMI in the online RDKit documentation appears to be to the PMI.h file
2) Having written a program using the PMI[123] and/or NPR[12] functions, I
couldn't get it to compile until I added the  -DRDK_BUILD_DESCRIPTORS3D
directive -
g++ -o sdf_pmi_blob sdf_pmi.cpp -I/packages/rdkit/include/rdkit
-L/packages/rdkit/lib -lDescriptors -lGraphMol -lFileParsers
-Wno-deprecated -O2 -DRDK_BUILD_DESCRIPTORS3D
This seems a bit odd...
3) Is it necessary to make separate calls to the individual PMI() and/or
NPR() functions? Surely this results in duplication of some of the heavier
calculations? I can't find any equivalent of calcPMIDescriptors() which
returned a 'Moments' struct containing all the PMI and NPR values in one go.
4) The big one! The returned results look very odd. They appear to relate
more to the dimensions of the molecule than the moments of inertia. For a
rod-like molecule (dimethylacetylene) I'd expect two large and one small
PMI (e.g. PMI1: 6.61651   PMI2: 150.434   PMI3: 150.434  NPR1: 0.0439828
NPR2: 0.98) but actually get PMI1: 0.061647  PMI2: 0.061652  PMI3:
25.3699  NPR1: 0.002430  NPR2: 0.002430.
For disk-like (benzene) the result should be one large and two medium (e.g.
PMI1: 89.1448  PMI2: 89.1495  PMI3: 178.294  NPR1: 0.499987  NPR2:
0.500013) but get PMI1: 2.37457e-10  PMI2: 11.0844  PMI3: 11.0851  NPR1:
2.14213e-11  NPR2: 0.33.
Finally for a roughly spherical molecule (neopentane) the NPR values look
reasonable (no great surprise) but the absolute PMI values may be too
small: old program - PMI1: 114.795  PMI2: 114.797  PMI3: 114.799
NPR1: 0.66  NPR2: 0.88, new program - PMI1: 6.59466  PMI2: 6.59488
PMI3: 6.59531  NPR1: 0.02  NPR2: 0.35

As I say, it's entirely likely that I'm doing something stupid here so any
pointers will be gratefully received. FWIW, the core of my program is -
mol = MolBlockToMol(ctab, true, false);
double pmi1 = RDKit::Descriptors::PMI1(*mol);
double pmi2 = RDKit::Descriptors::PMI2(*mol);
double pmi3 = RDKit::Descriptors::PMI3(*mol);
double npr1 = RDKit::Descriptors::NPR1(*mol);
double npr2 = RDKit::Descriptors::NPR2(*mol);

Thanks for any help!
Chris
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] PMI API

2017-01-08 Thread Chris Earnshaw
Hi David

Thanks for the rapid reply! Looks like a very useful document for people
getting started with the RDKit C++ API.

As you suspected, I'm slightly beyond that stage having been an RDKit user
for a number of years. My queries are specifically to do with using the PMI
functionality; most particularly why the numbers produced by the current
implementation don't appear to match expected values for particular shapes
of molecule, but also the lack of information about the PMI-related
functions in the main RDKit C++ API documentation and the apparently odd
requirement for the -DRDK_BUILD_DESCRIPTORS3D flag (looks more like a cmake
directive) when compiling a program which uses GraphMol/Descriptors3D
functions.

Cheers,
Chris

On 8 January 2017 at 12:13, David Cosgrove 
wrote:

> Hi Chris,
> I can help a bit with the first point - I am currently 'porting' the
> getting started in Python bit of the documentation to c++. There's a long
> way to go, but if you go to my fork of RDKit at https://github.com/
> DavidACosgrove and check out the GetStartedC++ branch, you can at least
> use what I've managed so far (https://github.com/
> DavidACosgrove/rdkit/blob/GetStartedC%2B%2B/Docs/Book/
> GettingStartedInC%2B%2B.md).  It's pretty basic stuff that you may
> already be beyond, but there are some examples and a CMakeLists.txt file
> that builds them which might be helpful.
>
>
> It's probably time I tidied it up (having just looked at it to get the
> link above, I see there's a typo on the first sentence, for example!) and
> sent in an interim Pull Request as for people starting out it might already
> be of value.
>
> Cheers,
> Dave
>
> On Sun, 8 Jan 2017 at 10:19, Chris Earnshaw  wrote:
>
>> Hi
>>
>> A while ago I had a project which needed PMI
>>
>> descriptors (specifically NPR1 and NPR2) which were not available in the
>>
>> main branch of RDKit at the time. At the time I used the fork by
>>
>> 'hahnda6' which provided the calcPMIDescriptors() function, and this
>>
>> worked well. Now that PMI descriptors are available in the main RDKit
>>
>> distrubution I thought I'd rewrite my code to use the official version.
>>
>> Building
>>
>> the new RDKit was no problem, but things went downhill shortly after
>>
>> that. There's every chance that I've missed the relevant documentation
>>
>> (I hope someone can point me in the right direction if so) and done
>>
>> something stupid!
>>
>> The issues are -
>> 1) I can't find
>>
>> any documentation of the C++ API - the only reference to PMI in the
>>
>> online RDKit documentation appears to be to the PMI.h file
>> 2)
>>
>> Having written a program using the PMI[123] and/or NPR[12] functions, I
>>
>> couldn't get it to compile until I added the  -DRDK_BUILD_DESCRIPTORS3D
>>
>> directive -
>> g++ -o sdf_pmi_blob sdf_pmi.cpp -I/packages/rdkit/include/rdkit
>> -L/packages/rdkit/lib -lDescriptors -lGraphMol -lFileParsers
>> -Wno-deprecated -O2 -DRDK_BUILD_DESCRIPTORS3D
>> This seems a bit odd...
>> 3)
>>
>> Is it necessary to make separate calls to the individual PMI() and/or
>>
>> NPR() functions? Surely this results in duplication of some of the
>>
>> heavier calculations? I can't find any equivalent of
>>
>> calcPMIDescriptors() which returned a 'Moments' struct containing all
>>
>> the PMI and NPR values in one go.
>> 4) The big one! The
>>
>> returned results look very odd. They appear to relate more to the
>>
>> dimensions of the molecule than the moments of inertia. For a rod-like
>>
>> molecule (dimethylacetylene) I'd expect two large and one small PMI
>>
>> (e.g. PMI1: 6.61651   PMI2: 150.434   PMI3: 150.434  NPR1: 0.0439828
>>
>> NPR2: 0.98) but actually get PMI1: 0.061647  PMI2: 0.061652  PMI3:
>>
>> 25.3699  NPR1: 0.002430  NPR2: 0.002430.
>> For disk-like (benzene) the
>>
>> result should be one large and two medium (e.g. PMI1: 89.1448  PMI2:
>>
>> 89.1495  PMI3: 178.294  NPR1: 0.499987  NPR2: 0.500013) but get PMI1:
>>
>> 2.37457e-10  PMI2: 11.0844  PMI3: 11.0851  NPR1: 2.14213e-11  NPR2:
>>
>> 0.33.
>> Finally for a roughly spherical molecule (neopentane) the
>>
>> NPR values look reasonable (no great surprise) but the absolute PMI
>>
>> values may be too small: old program - PMI1: 114.795  PMI2: 114.797
>>
>> PMI3: 114.799
>> NPR1: 0.66  NPR2: 0.88, new program - PMI1: 6.59466  PMI2:
>> 6.59

Re: [Rdkit-discuss] PMI API

2017-01-08 Thread Chris Earnshaw
Hi Brian & Greg

Many thanks for the replies. I built RDKit with Descriptors3D enabled
without any problems, it was working out how to tell the compiler to
process my source code using the new functions which was troublesome. It
would be very helpful if the need for the -DRDK_BUILD_DESCRIPTORS3D
compiler directive was documented, e.g. with a comment near the top of
PMI.h, at least until a better solution is in place.

Good to know that the expensive calculation is only done once. Hope it
won't be difficult to sort out the strange PMI & NPR values - please let me
kbow if you need any more information from me.

Chris Earnshaw


On 8 Jan 2017 18:17, "Brian Kelley"  wrote:

I think the relevant issue is that if you are using an existing build, we
don't yet have the capability for you to know what was built and what was
not.  I.e. You need to add the compiler flag to indicate that the 3D stuff
was actually built.

I had a PR to fix this a while ago that was postponed that we should
probably resurrect.  Basically it is an rdkit.h header file that has these
flags built in so you won't have to include them yourself.


Brian Kelley

On Jan 8, 2017, at 11:31 AM, Greg Landrum  wrote:

Hi Chris,

The RDKit should automatically build with the new descriptors enabled if
eigen3 can be found when cmake is run. When you run cmake you should see a
message if/when the build is disabled.

If you want to call the functions, the best documentation available is the
standard C++ API documentation, but something seems to have gone wrong when
I ran doxygen. I'll look into this. That documentation is generated from
the header file, so you can just look there:
https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Descriptors/PMI.h
not that there's a huge amount of documentation available.

W.r.t. efficiency: you do need to call the functions individually, but the
expensive calculation of the moments will only be done once, so it doesn't
end up doing repeated work.

And, finally, on the values themselves: I will have to take a look at that.
-greg




On Sun, Jan 8, 2017 at 11:17 AM, Chris Earnshaw 
wrote:

> Hi
>
> A while ago I had a project which needed PMI descriptors (specifically
> NPR1 and NPR2) which were not available in the main branch of RDKit at the
> time. At the time I used the fork by 'hahnda6' which provided the
> calcPMIDescriptors() function, and this worked well. Now that PMI
> descriptors are available in the main RDKit distrubution I thought I'd
> rewrite my code to use the official version.
>
> Building the new RDKit was no problem, but things went downhill shortly
> after that. There's every chance that I've missed the relevant
> documentation (I hope someone can point me in the right direction if so)
> and done something stupid!
>
> The issues are -
> 1) I can't find any documentation of the C++ API - the only reference to
> PMI in the online RDKit documentation appears to be to the PMI.h file
> 2) Having written a program using the PMI[123] and/or NPR[12] functions, I
> couldn't get it to compile until I added the  -DRDK_BUILD_DESCRIPTORS3D
> directive -
> g++ -o sdf_pmi_blob sdf_pmi.cpp -I/packages/rdkit/include/rdkit
> -L/packages/rdkit/lib -lDescriptors -lGraphMol -lFileParsers
> -Wno-deprecated -O2 -DRDK_BUILD_DESCRIPTORS3D
> This seems a bit odd...
> 3) Is it necessary to make separate calls to the individual PMI() and/or
> NPR() functions? Surely this results in duplication of some of the heavier
> calculations? I can't find any equivalent of calcPMIDescriptors() which
> returned a 'Moments' struct containing all the PMI and NPR values in one go.
> 4) The big one! The returned results look very odd. They appear to relate
> more to the dimensions of the molecule than the moments of inertia. For a
> rod-like molecule (dimethylacetylene) I'd expect two large and one small
> PMI (e.g. PMI1: 6.61651   PMI2: 150.434   PMI3: 150.434  NPR1: 0.0439828
> NPR2: 0.98) but actually get PMI1: 0.061647  PMI2: 0.061652  PMI3:
> 25.3699  NPR1: 0.002430  NPR2: 0.002430.
> For disk-like (benzene) the result should be one large and two medium
> (e.g. PMI1: 89.1448  PMI2: 89.1495  PMI3: 178.294  NPR1: 0.499987  NPR2:
> 0.500013) but get PMI1: 2.37457e-10  PMI2: 11.0844  PMI3: 11.0851  NPR1:
> 2.14213e-11  NPR2: 0.33.
> Finally for a roughly spherical molecule (neopentane) the NPR values look
> reasonable (no great surprise) but the absolute PMI values may be too
> small: old program - PMI1: 114.795  PMI2: 114.797  PMI3: 114.799
> NPR1: 0.66  NPR2: 0.88, new program - PMI1: 6.59466  PMI2:
> 6.59488  PMI3: 6.59531  NPR1: 0.02  NPR2: 0.35
>
> As I say, it's entirely likely that I'm doing something stupid here so any
> pointers will be gratefully 

Re: [Rdkit-discuss] PMI API

2017-01-13 Thread Chris Earnshaw
I can confirm that removing the conditional compilation directive
#ifdef RDK_BUILD_DESCRIPTORS3D
(and the corresponding #endif)
from PMI.h allows compilation without having to worry about the
-DRDK_BUILD_DESCRIPTORS3D directive. I think this is a worthwhile change.

Any thoughts about the numerical PMI and NPR values produced by the current
implementation? I still can't make any sense of them.

Chris

On 9 January 2017 at 05:18, Greg Landrum  wrote:

> A more straightforward solution to this one, and what I probably should
> have done in the first place, would be to not include the conditional
> compilation directives in the PMI.h header file. It should be fine to have
> the declarations in the header even if there is no corresponding
> definition, and then client code wouldn't need to know about the extra
> options.
>
> PR coming.
>
> On Sun, Jan 8, 2017 at 7:17 PM, Brian Kelley  wrote:
>
>> I think the relevant issue is that if you are using an existing build, we
>> don't yet have the capability for you to know what was built and what was
>> not.  I.e. You need to add the compiler flag to indicate that the 3D stuff
>> was actually built.
>>
>> I had a PR to fix this a while ago that was postponed that we should
>> probably resurrect.  Basically it is an rdkit.h header file that has these
>> flags built in so you won't have to include them yourself.
>>
>> 
>> Brian Kelley
>>
>> On Jan 8, 2017, at 11:31 AM, Greg Landrum  wrote:
>>
>> Hi Chris,
>>
>> The RDKit should automatically build with the new descriptors enabled if
>> eigen3 can be found when cmake is run. When you run cmake you should see a
>> message if/when the build is disabled.
>>
>> If you want to call the functions, the best documentation available is
>> the standard C++ API documentation, but something seems to have gone wrong
>> when I ran doxygen. I'll look into this. That documentation is generated
>> from the header file, so you can just look there:
>> https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/
>> Descriptors/PMI.h
>> not that there's a huge amount of documentation available.
>>
>> W.r.t. efficiency: you do need to call the functions individually, but
>> the expensive calculation of the moments will only be done once, so it
>> doesn't end up doing repeated work.
>>
>> And, finally, on the values themselves: I will have to take a look at
>> that.
>> -greg
>>
>>
>>
>>
>> On Sun, Jan 8, 2017 at 11:17 AM, Chris Earnshaw 
>> wrote:
>>
>>> Hi
>>>
>>> A while ago I had a project which needed PMI descriptors (specifically
>>> NPR1 and NPR2) which were not available in the main branch of RDKit at the
>>> time. At the time I used the fork by 'hahnda6' which provided the
>>> calcPMIDescriptors() function, and this worked well. Now that PMI
>>> descriptors are available in the main RDKit distrubution I thought I'd
>>> rewrite my code to use the official version.
>>>
>>> Building the new RDKit was no problem, but things went downhill shortly
>>> after that. There's every chance that I've missed the relevant
>>> documentation (I hope someone can point me in the right direction if so)
>>> and done something stupid!
>>>
>>> The issues are -
>>> 1) I can't find any documentation of the C++ API - the only reference to
>>> PMI in the online RDKit documentation appears to be to the PMI.h file
>>> 2) Having written a program using the PMI[123] and/or NPR[12] functions,
>>> I couldn't get it to compile until I added the  -DRDK_BUILD_DESCRIPTORS3D
>>> directive -
>>> g++ -o sdf_pmi_blob sdf_pmi.cpp -I/packages/rdkit/include/rdkit
>>> -L/packages/rdkit/lib -lDescriptors -lGraphMol -lFileParsers
>>> -Wno-deprecated -O2 -DRDK_BUILD_DESCRIPTORS3D
>>> This seems a bit odd...
>>> 3) Is it necessary to make separate calls to the individual PMI() and/or
>>> NPR() functions? Surely this results in duplication of some of the heavier
>>> calculations? I can't find any equivalent of calcPMIDescriptors() which
>>> returned a 'Moments' struct containing all the PMI and NPR values in one go.
>>> 4) The big one! The returned results look very odd. They appear to
>>> relate more to the dimensions of the molecule than the moments of inertia.
>>> For a rod-like molecule (dimethylacetylene) I'd expect two large and one
>>> small PMI (e.g. PMI1: 6.61651   PMI2: 150.434   PMI3: 150.434  NPR1:
>>> 0.043

Re: [Rdkit-discuss] PMI API

2017-01-15 Thread Chris Earnshaw
Thanks Greg

I've built a version of RDKit with fixes from https://github.com/
greglandrum/rdkit/tree/fix/github1262 and can confirm that it gives exactly
the same values of PMI and NPR that I got with the RDKit fork by 'hahnda6'.
I can't say for certain that the PMI values are correct in absolute terms,
but the NPR values are certainly what would be expected for those test
molecules.

I'm worried about the Todeschini paper - I think there are errors in some
of the equations and inconsistencies in the discussion, some of which may
involve mixing up PMIs with eigenvalues of the covariance matrix.
Unfortunately I don't have access to the original references so can't check
in detail, but I'd be disinclined to take any of the equations at face
value.

Chris

On 15 January 2017 at 10:50, Greg Landrum  wrote:

> I managed to make some time to look into this this weekend and I've found
> a bug and something I don't understand. Hopefully the community can help
> out here.
>
> On Sun, Jan 8, 2017 at 11:17 AM, Chris Earnshaw 
> wrote:
>
>> 4) The big one! The returned results look very odd. They appear to relate
>> more to the dimensions of the molecule than the moments of inertia. For a
>> rod-like molecule (dimethylacetylene) I'd expect two large and one small
>> PMI (e.g. PMI1: 6.61651   PMI2: 150.434   PMI3: 150.434  NPR1: 0.0439828
>> NPR2: 0.98) but actually get PMI1: 0.061647  PMI2: 0.061652  PMI3:
>> 25.3699  NPR1: 0.002430  NPR2: 0.002430.
>> For disk-like (benzene) the result should be one large and two medium
>> (e.g. PMI1: 89.1448  PMI2: 89.1495  PMI3: 178.294  NPR1: 0.499987  NPR2:
>> 0.500013) but get PMI1: 2.37457e-10  PMI2: 11.0844  PMI3: 11.0851  NPR1:
>> 2.14213e-11  NPR2: 0.33.
>> Finally for a roughly spherical molecule (neopentane) the NPR values look
>> reasonable (no great surprise) but the absolute PMI values may be too
>> small: old program - PMI1: 114.795  PMI2: 114.797  PMI3: 114.799
>> NPR1: 0.66  NPR2: 0.88, new program - PMI1: 6.59466  PMI2:
>> 6.59488  PMI3: 6.59531  NPR1: 0.02  NPR2: 0.35
>>
>
> Your expectations are correct: the current RDKit implementation is wrong.
> The corresponding github entry is here: https://github.com/
> rdkit/rdkit/issues/1262
> This is due to a mistake in the way the principal moments are calculated
> (which is due to the fact that I don't spend a lot of time working
> with/thinking about 3D descriptors). Instead of using the
> eigenvectors/eigenvalues of the inertia matrix (the tensor of inertia) the
> RDKit is currently using the covariance matrix. There's some more on the
> relationship between these two here: http://number-none.com/
> blow/inertia/deriving_i.html
>
> The problem is easy to fix (and I have something working here:
> https://github.com/greglandrum/rdkit/tree/fix/github1262), but it screws
> up the values of the descriptors that are derived from here:
> Todeschini and Consoni "Descriptors from Molecular Geometry" Handbook of
> Chemoinformatics http://dx.doi.org/10.1002/9783527618279.ch37
> These include the radius of gyration, inertial shape factor, etc.
> Within that article they state that Ic = 0 for planar molecules. Ignoring
> the inequality on page 1010, which says that Ic is the largest moment and
> is contradicted by the rest of the text (particularly the inequalities on
> page 1011), Ic corresponds to the smallest principal moment : PMI1.
>
> So now I'm confused, but I'm hoping this is obvious to someone versed in
> the field: I'd like to reproduce the descriptors described in the
> Todeschini article, but I clearly can't do that using the actual moments of
> inertia. I could keep using the eigenvalues of the covariance matrix there,
> but that doesn't match what's described in the text.
>
> Two things that would be extremely helpful:
> 1) an explanation of the disconnect here from someone who knows this
> stuff, I would guess that it's pretty simple
> 2) The results of running the files github1262_1.mol, github1262_2.mol,
> and github1262_3.mol from here: https://github.com/
> greglandrum/rdkit/tree/fix/github1262/Code/GraphMol/
> MolTransforms/test_data through Dragon and calculating the radius of
> gyration, inertial shape factor, eccentricity, molecular asphericity, and
> spherocity index.
>
> Best,
> -greg
>
>
>
>>
>
--
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] PMI API

2017-01-16 Thread Chris Earnshaw
On 16 January 2017 at 06:25, Guillaume GODIN 
wrote:

> reading carefully the Todeschini article, them said that Ic,Ib,Ia are
> determine as max & min values of I other all 3D axis passing throught the
> center of mass!
>
I don't quite understand this comment. The inequality Ia <= Ib <= Ic is one
of the errors in the Todeschini article pointed out by Greg yesterday. By
definition, the Principal Moment of Inertia axes pass through the centre of
mass.

The "global PM" is never zero (sum of mi*ri*ri) idem for PMi even for
> planar molecule.
>
The global Moment of Inertia is only zero for monatomics.


> But When you have a planar molecule, the matrix is no more 3D but 2D! so
> it's normal to consider that the 3nd PM is zero.
>
I really don't understand this - it's simply wrong. The molecule may be 2D
but the three principal moments of inertia are most definitely non-zero for
a planar structure. For a fully symmetrical molecule like benzene the
largest PMI is around the axis perpendicular to the plane of the molecule
and there are two equivalent, smaller, PMIs perpendicular to each other in
the plane of the molecule. For a less symmetrical molecule like
naphthalene, the largest PMI is again around the axis perpendicular to the
plane, the intermediate PMI is along the fusion bond between the rings and
the smallest PMI is around the long axis of the molecule. There's no way it
can be correct to consider the 3rd PMI as zero in any planar molecule -
it's never equal to zero and is only degenerate with the 2nd PMI for fully
symmetric molecules. Only in the special case of a completely linear
molecule (e.g. acetylene, HCN) is the 3rd PMI (along the axis of the
molecule) equal to zero.

Apologies - I appear to have opened a can of worms here...

Chris

> --
> *De :* Greg Landrum 
> *Envoyé :* dimanche 15 janvier 2017 17:42
> *À :* Guillaume GODIN; RDKit Discuss
>
> *Objet :* Re: [Rdkit-discuss] PMI API
>
> Thanks Guillaume!
>
> On Sun, Jan 15, 2017 at 5:01 PM, Guillaume GODIN <
> guillaume.go...@firmenich.com> wrote:
>
>> Here, Dragon results for the 3 molecules: I've included both  Whim and 3D
>> descriptors but I don't have access to PMi!
>>
>>
>> I found the second document in agreement with Peter answer...
>>
>>
>> BR,
>>
>> *Dr. Guillaume GODIN*
>> Principal Scientist
>> Chemoinformatic & Datamining
>> Innovation
>> CORPORATE R&D DIVISION
>> DIRECT LINE +41 (0)22 780 3645 <+41%2022%20780%2036%2045>
>> MOBILE  +41 (0)79 536 1039 <+41%2079%20536%2010%2039>
>> Firmenich SA
>> RUE DES JEUNES 1 | CASE POSTALE 239 | CH-1211 GENEVE 8
>>
>> --
>> *De :* Peter Gedeck 
>> *Envoyé :* dimanche 15 janvier 2017 15:07
>> *À :* Greg Landrum; RDKit Discuss; Guillaume GODIN
>>
>> *Objet :* Re: [Rdkit-discuss] PMI API
>>
>> According to this:
>> https://en.wikipedia.org/wiki/List_of_moments_of_inertia
>> The moments of inertia of a disk (something like benzene) are:
>>
>> Iz = mr^2/2
>> Ix = Iy = mr^2/4
>>
>> None of them is zero. The smallest moment of inertia of a rod-like
>> molecule (e.g. C#C) is zero.
>>
>> Best,
>>
>> Peter
>>
>>
>>
>> On Sun, Jan 15, 2017 at 8:15 AM Greg Landrum 
>> wrote:
>>
>>> Hi Guillaume,
>>>
>>> I think it this case it's something else. According to the Todeschini
>>> article the smallest moment of inertia of a planar molecule like benzene
>>> should be zero. The eigenvalues of the inertia matrix for benzene, however,
>>> are definitely not zero (and not close enough that it's likely to be
>>> round-off error).
>>> It would be very nice if you could run the three files I mention through
>>> Dragon and let me know what it calculates for those descriptors.
>>>
>>> -greg
>>>
>>>
>>> _
>>> From: Guillaume GODIN 
>>> Sent: Sunday, January 15, 2017 1:11 PM
>>> Subject: RE: [Rdkit-discuss] PMI API
>>> To: Greg Landrum , RDKit Discuss <
>>> rdkit-discuss@lists.sourceforge.net>, Chris Earnshaw <
>>> cgearns...@gmail.com>
>>>
>>>
>>>
>>> Dear Greg,
>>>
>>>
>>> I  suspect that it's a precision error or eigen algorithm shift between
>>> rdkit c++ & dragon.
>>>
>>>
>>> To obtain good value, I suggest to try to implement a test on the eigen
>>> values like i did in g

Re: [Rdkit-discuss] PMI API

2017-01-16 Thread Chris Earnshaw
Dear Guillaume

Thanks - looks like we agree about reality (good!) and that Todeschini et
al. are wrong in their discussion about planar molecules. Whether this is a
simple mistaken assertion, or if they've mixed up another quantity (e.g.
the eigenvalues of the covariance matrix) with the PMIs is impossible to
say.

Either way, it makes it rather hard to trust their derivations generally -
especially as there appear to be other errors (e.g. the denominator in eq.
16 should be the square root of the given sum of squares, according to
their reference).

Best regards,
Chris

Dr Chris Earnshaw
CGE Computational Chemistry
Phone: +44(0) 1223 426000
Mobile: 07944 707773
E-mail: ch...@cge-compchem.co.uk

On 16 January 2017 at 08:54, Guillaume GODIN 
wrote:

> Dear Chris,
>
>
> No prob let me explain:
>
>
> I Aggree on monoatomics center of mass is the atom so  (for all x axis:
> Ix= 0)
>
> ​
>
> Now I consider the mathematics only not the physics.
>
>
> I suggest that they (Todeschini) are not really computing the "real
> physical" PMi on the 3 axis but arbitrary said that for 2D molecules the
> 3nd axis PMi is zero.
>
>
> BR
>
>
>
> *Dr. Guillaume GODIN*
> Principal Scientist
> Chemoinformatic & Datamining
> Innovation
> CORPORATE R&D DIVISION
> DIRECT LINE +41 (0)22 780 3645 <+41%2022%20780%2036%2045>
> MOBILE  +41 (0)79 536 1039 <+41%2079%20536%2010%2039>
>     Firmenich SA
> RUE DES JEUNES 1 | CASE POSTALE 239 | CH-1211 GENEVE 8
>
> --
> *De :* Chris Earnshaw 
> *Envoyé :* lundi 16 janvier 2017 09:36
> *À :* Guillaume GODIN
> *Cc :* Greg Landrum; RDKit Discuss
>
> *Objet :* Re: [Rdkit-discuss] PMI API
>
>
>
> On 16 January 2017 at 06:25, Guillaume GODIN <
> guillaume.go...@firmenich.com> wrote:
>
>> reading carefully the Todeschini article, them said that Ic,Ib,Ia are
>> determine as max & min values of I other all 3D axis passing throught the
>> center of mass!
>>
> I don't quite understand this comment. The inequality Ia <= Ib <= Ic is
> one of the errors in the Todeschini article pointed out by Greg yesterday.
> By definition, the Principal Moment of Inertia axes pass through the centre
> of mass.
>
> The "global PM" is never zero (sum of mi*ri*ri) idem for PMi even for
>> planar molecule.
>>
> The global Moment of Inertia is only zero for monatomics.
>
>
>> But When you have a planar molecule, the matrix is no more 3D but 2D! so
>> it's normal to consider that the 3nd PM is zero.
>>
> I really don't understand this - it's simply wrong. The molecule may be 2D
> but the three principal moments of inertia are most definitely non-zero for
> a planar structure. For a fully symmetrical molecule like benzene the
> largest PMI is around the axis perpendicular to the plane of the molecule
> and there are two equivalent, smaller, PMIs perpendicular to each other in
> the plane of the molecule. For a less symmetrical molecule like
> naphthalene, the largest PMI is again around the axis perpendicular to the
> plane, the intermediate PMI is along the fusion bond between the rings and
> the smallest PMI is around the long axis of the molecule. There's no way it
> can be correct to consider the 3rd PMI as zero in any planar molecule -
> it's never equal to zero and is only degenerate with the 2nd PMI for fully
> symmetric molecules. Only in the special case of a completely linear
> molecule (e.g. acetylene, HCN) is the 3rd PMI (along the axis of the
> molecule) equal to zero.
>
> Apologies - I appear to have opened a can of worms here...
>
> Chris
>
>> --
>> *De :* Greg Landrum 
>> *Envoyé :* dimanche 15 janvier 2017 17:42
>> *À :* Guillaume GODIN; RDKit Discuss
>>
>> *Objet :* Re: [Rdkit-discuss] PMI API
>>
>> Thanks Guillaume!
>>
>> On Sun, Jan 15, 2017 at 5:01 PM, Guillaume GODIN <
>> guillaume.go...@firmenich.com> wrote:
>>
>>> Here, Dragon results for the 3 molecules: I've included both  Whim and
>>> 3D descriptors but I don't have access to PMi!
>>>
>>>
>>> I found the second document in agreement with Peter answer...
>>>
>>>
>>> BR,
>>>
>>> *Dr. Guillaume GODIN*
>>> Principal Scientist
>>> Chemoinformatic & Datamining
>>> Innovation
>>> CORPORATE R&D DIVISION
>>> DIRECT LINE +41 (0)22 780 3645 <+41%2022%20780%2036%2045>
>>> MOBILE  +41 (0)79 536 1039 <+41%2079%20536%2010%2039>
>>> Firmenich SA
>>>

Re: [Rdkit-discuss] PMI API

2017-01-17 Thread Chris Earnshaw
The new version looks good to me as far as I can test it. PMI and NPR are
still fine, the radius of gyration is right (for an extremely artificial
test system) and the asphericity index also seems right (despite my best
efforts to confuse things further - sorry about that!). Also highlights
even more confusion in the Todeschini article - the approximate asphericity
values for prolate and oblate molecules are reversed.

The only (very trivial) thing I've spotted is the comment in the
inertialShapeFactor function. 'planar or no coordinates' should be 'linear
or no coordinates' to avoid confusion.

Chris

On 16 January 2017 at 09:30, Greg Landrum  wrote:

>
>
> On Mon, Jan 16, 2017 at 10:22 AM, Chris Earnshaw  > wrote:
>
>>
>> Either way, it makes it rather hard to trust their derivations generally
>> - especially as there appear to be other errors (e.g. the denominator in
>> eq. 16 should be the square root of the given sum of squares, according to
>> their reference).
>>
>
> Indeed. Given the problems encountered, I went back and checked some
> additional references to find definitions of the descriptors. The results
> are in this PR, which I'd love feedback on if you have time to take a look:
> https://github.com/rdkit/rdkit/pull/1265
>
> I didn't manage to find any information about "inertial shape factor" and
> don't have access to the references cited in the Todeschini paper, but I
> think the others are now reasonably reliable.
>
> -greg
>
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] PMI API

2017-01-17 Thread Chris Earnshaw
The dimensions along one of the axes of a planar molecule in its inertial
frame will be zero, but the principal moments of inertia will all be
non-zero. The moment of inertia about an axis can only be zero if all the
atoms in the molecule are precisely aligned on that axis. That's only
possible for linear molecules. There's no way to draw a straight line axis
through all the atoms in a non-linear molecule, which would be a
requirement for the corresponding moment of inertia to be zero.

Chris

On 17 January 2017 at 12:29, Brian Kelley  wrote:

> Looks like I'm late to the game.  I don't know about the PMI descriptors
> per-se, but if a planar molecule is in it's inertial frame, one of the axes
> should be zero (whether it is x, y or z) which means that the one of the
> M1x, M1y or M1z should be zero.
>
> We had some good experimentation with multipole expansion of moments
> (essentially based on the description of electrostatic multipoles) that
> might be nice to add to the PMI framework.
>
> Greg, I'm assuming that the Moments.py we opensourced a while back is
> similarly broken?  I'm attaching it here for posterity but it does appear
> to match the moe PMI's.
>
>
>
> On Tue, Jan 17, 2017 at 4:55 AM, Chris Earnshaw 
> wrote:
>
>> The new version looks good to me as far as I can test it. PMI and NPR are
>> still fine, the radius of gyration is right (for an extremely artificial
>> test system) and the asphericity index also seems right (despite my best
>> efforts to confuse things further - sorry about that!). Also highlights
>> even more confusion in the Todeschini article - the approximate asphericity
>> values for prolate and oblate molecules are reversed.
>>
>> The only (very trivial) thing I've spotted is the comment in the
>> inertialShapeFactor function. 'planar or no coordinates' should be 'linear
>> or no coordinates' to avoid confusion.
>>
>> Chris
>>
>> On 16 January 2017 at 09:30, Greg Landrum  wrote:
>>
>>>
>>>
>>> On Mon, Jan 16, 2017 at 10:22 AM, Chris Earnshaw <
>>> ch...@cge-compchem.co.uk> wrote:
>>>
>>>>
>>>> Either way, it makes it rather hard to trust their derivations
>>>> generally - especially as there appear to be other errors (e.g. the
>>>> denominator in eq. 16 should be the square root of the given sum of
>>>> squares, according to their reference).
>>>>
>>>
>>> Indeed. Given the problems encountered, I went back and checked some
>>> additional references to find definitions of the descriptors. The results
>>> are in this PR, which I'd love feedback on if you have time to take a look:
>>> https://github.com/rdkit/rdkit/pull/1265
>>>
>>> I didn't manage to find any information about "inertial shape factor"
>>> and don't have access to the references cited in the Todeschini paper, but
>>> I think the others are now reasonably reliable.
>>>
>>> -greg
>>>
>>>
>>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] mass replacement of External R-groups with many substituents

2017-03-16 Thread Chris Earnshaw
Hi Brian

I'm by no means an expert in RDKit with Python, but until someone else
comes along, here are a few thoughts.

Your reaction SMARTS specifically defines aromatic carbons joined by single
bonds which won't match an incoming benzene ring, and it's a bit redundant
to specify that aromatic carbons are in rings. It's possibly also neater to
specify the ring closure if you want to include the whole ring. Finally, I
don't think that you can specify dummy atoms in a reaction SMARTS as you're
attempting (it certainly doesn't work for me with R and R1) but you
certainly can specify 'placeholder' atoms using real elements. I've found
U, V & W handy for this in the past.

A small Python script with a modified SMARTS which works for me is -
from rdkit import Chem
from rdkit.Chem import AllChem
rxn =
AllChem.ReactionFromSmarts('[c:1]1[c:2][c:3][c:4][c:5]([U:7])[c:6]1.[V:8]-[*:9]
>> [c:1]1[c:2][c:3][c:4][c:5](-[*:9])[c:6]1')
ps =
rxn.RunReactants((Chem.MolFromSmiles('c1c1[U]'),Chem.MolFromSmiles('[V]Cl')))
Chem.MolToSmiles(ps[0][0])

This returns 'Clc1c1'.

Unless you have some specific reason to include the complete benzene ring,
the reaction SMARTS can be simplified a lot further, to -
[c:1][U:2].[V:3]-[*:4] >> [c:1]-[*:4]
This will carry out your virtual reaction on any aromatic carbon carrying
the appropriate placeholder element.

I'd be interested to learn if there is an approved way to mark substitution
positions in reactions like this without having to resort to the cheat of
using obscure elements.

Regards,
Chris

On 15 March 2017 at 23:20, Bennion, Brian  wrote:

> Hello All,
>
>
>
> I have looked around the email list and studied the daylight guide as well
> as the opensmiles website in hopes of solving my problem.  External
> r-groups is a proposed extension but that is all I could find.
>
>
>
> It is possible that I have made it too complicated though.
>
>
>
> In discussions with my synthetic chemist we came up with 27 substituents
> two place around our molecule scaffold.
>
> I labeled the scaffolds with R1 in the position that the substitution will
> occur and attached a dummy label to the substituents.  I thought that I
> could do a simple replacement rxn.  However, I have not been successful.
>
>
>
> The smarts string so far is listed below.
>
>
>
> AllChem.ReactionFromSmarts('[c;R:1]-[c;R:2]-[c;R:3]-[c;R:4]
> -[c;R:5]([R1;!R:7])-[c;R:6].[R:8]-[*:9] >> [c;R:1]-[c;R:2]-[c;R:3]-[c;R:
> 4]-[c;R:5]([*:9])-[c;R:6]')
>
>
>
> Basically just adding a group to positions along a benzene ring
>
>
>
> Thanks
>
> Brian Bennion
>
>
>
>
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMARTS substructure queries with SQL conjunctions

2017-03-21 Thread Chris Earnshaw
Hi Akos

Very strange behaviour. I don't see anything wrong with your SQL syntax.
I've tried equivalent searches in my 2.6M compound database and they give
the expected results. I used iodine rather than gold, for which there are
19504 structures. Adding the qualifying SQL clauses singly and in
combination gives -
SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6]~[#6]'::qmol;
 Ic1nc[nH]n1
 Ic1nnc[nH]1
 O=P(O)(CI)CI
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
 ICI
 Cn1nnnc1I
(15 rows)

SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6!H0]'::qmol;
 Nc1c(I)c(C(=O)O)c(I)c(C(=O)O)c1I
 O=C(O)c1c(I)c(I)c(I)c(I)c1C(=O)O
 Oc1nnc(I)c(O)n1
 O=c1[nH]c(Cl)c(I)c(=O)[nH]1
 O=c1[nH]nc(I)c(=O)[nH]1
 Fc1c(F)c(F)c(SC(I)=C(Sc2c(F)c(F)c(F)c(F)c2F)Sc2c(F)c(F)c(F)c(F)c2F)c(F)c1F
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
 Nc1nc(N)c(I)c(O)n11
 NC(=O)c1nn[nH]c1I
(18 rows)

SELECT structure FROM scr_structure WHERE structure @> 'I'::qmol AND NOT
structure @> '[#6]~[#6]'::qmol AND NOT structure @> '[#6!H0]'::qmol;
 ClI
 BrI
 [O-][I+3]([O-])([O-])O
 I[Cd]I
 [K]I
 [O-][I+2]([O-])O.[O-][I+2]([O-])O
 I[Hg]I
 [O-][I+2]([O-])O[I+2]([O-])[O-]
 [O-][I+2]([O-])O
 [Na]I
(10 rows)

This is with RDKit 2016-09 and postgresql-9.5. The returned hits are all
consistent with the additional qualifiers. BTW, the presence or absence of
parentheses around the clauses makes no difference, nor does replacing
[C,c] by [#6].

If the queries are running correctly (and not knowing the structure of your
database), is there any possibility of ambiguity in the cid values in the
cpds table? Do they have a primary key / uniqueness constraint? If any cids
are associated with multiple structures then the structure searching could
be working fine, but you might sometimes get spurious structures returned
from the cid lookup. You can check this with the query -
SELECT cid, COUNT(cid) FROM cpds GROUP BY cid HAVING COUNT(cid) > 1;
This will return any cid values which occur more than once in the cpds
table.

Failing that, the only time I've seen behaviour resembling your report was
with a database (using a different cartridge, in Oracle) where the
structure index had become corrupted. I don't know if that's even a
possibility with RDKit, but it might be worth considering an index rebuild
if everything else looks OK.

Best regards,
Chris


On 21 March 2017 at 06:34, Akos Kokai  wrote:

> Dear RDKit community,
>
> I'm getting unexpected results when combining SMARTS substructure
> comparisons in SQL statements, and I'd like to ask for feedback to help me
> understand what's going on.
>
> Given an element, say Au, when I make a query like this:
>
> SELECT cpds.cid FROM cpds WHERE (cpds.molecule @> '[Au]' ::qmol) AND NOT
> (cpds.molecule @> '[C,c]~[C,c]' ::qmol) AND NOT (cpds.molecule @>
> '[C!H0,c!H0]' ::qmol)
>
> I don't expect to see any compounds with C-C or C-H bonds in the results.
> Yet I get results like [(P(C5F5)3)4Au]Cl [1], or for example with Se,
> [(CH3)3Se]+ [2]. Why?
>
> It seems that usually my 'unexpected' results are matching one of the two
> "AND NOT" conditions, not both (see console output below) but I haven't
> checked systematically. I want the query to return only molecules for which
> the last two substructure conditions are both false. Is my understanding of
> SQL conjunctions mistaken?
>
> I'm using RDKit 2016-03 and the rdkit extension on PostgreSQL 9.4. I'm
> probably not using RDKit for what it was intended, but I'm certainly
> grateful that it exists and is free software. I'd very much appreciate any
> feedback on this question.
>
> Best regards,
> Akos
>
> --
>
> [1]: https://pubchem.ncbi.nlm.nih.gov/compound/11520592
> [2]: https://pubchem.ncbi.nlm.nih.gov/compound/91580
>
> Some console output regarding those compounds:
>
> In [3]: mSe = Chem.MolFromSmiles('C[Se+](C)C')
>
> In [4]: mAu = Chem.MolFromSmiles('C1(=C(C(=C(C(=C1F)F)P(C2=C(C(=C(C(=C2F)
> F)F)F)F)C3=C(C(=C(C(=C3F)F)F)F)F
>...: )F)F)F.Cl[Au]')
>
> In [5]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
> Out[5]: False
>
> In [6]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C,c]~[C,c]'))
> Out[6]: True
>
> In [7]: mSe.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
> Out[7]: True
>
> In [8]: mAu.HasSubstructMatch(Chem.MolFromSmarts('[C!H0,c!H0]'))
> Out[8]: False
>
>
> Akos Kokai 
> PhD candidate, Department of Environmental Science, Policy & Management
> 
> Fellow, Berkeley Center for Green Chemistry 
> University of California, Berkeley
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engagin

Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected

2017-09-13 Thread Chris Earnshaw
Hi

The problem is due to RDkit perceiving the embedded pyranone in
CHEMBL1999443 as an aromatic system, which is probably correct. However, in
the structure of aspirin the carboxyl carbon and singly bonded oxygen are
non-aromatic, so if you just use the SMILES of aspirin as a query it won't
match CHEMBL1999443

You'll need to use a slightly more generic aspirin-like query to allow the
possibility of matching both 'normal' aspirin and embedded aromatic
analogues. CC(=O)Oc1c1[#6](=O)[#8] should work OK.

Regards,
Chris

On 13 September 2017 at 13:40, Michał Nowotka  wrote:

> Hi,
>
> This problem is probably due to my lack of chemistry knowledge but
> plese have a look:
>
> If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
> query (ChEMBL API uses the Symix catridge):
>
> from chembl_webresource_client.new_client import new_client
> res = new_client.substructure.filter(chembl_id='CHEMBL25')
>
> One of them will be CHEMBL1999443:
>
> 'CHEMBL1999443' in (r['molecule_chembl_id'] for r in res)
> >>> True
>
> Now I take the molfile:
>
> new_client.molecule.set_format('mol')
> mol = new_client.molecule.get('CHEMBL1999443')
>
> and load it with aspirin into rdkit:
>
> from rdkit import Chem
> m = Chem.MolFromMolBlock(mol)
> pattern = Chem.MolFromMolBlock(new_client.molecule.get('CHEMBL25'))
>
> If I check if it has an aspirin as a substructure using rdkit, I'm
> getting false...
>
> m.HasSubstructMatch(pattern)
> >>> False
>
> Looking at this blog post:
> https://github.com/rdkit/rdkit-tutorials/blob/master/notebooks/002_SMARTS_
> SubstructureMatching.ipynb
> I tried to initialize rings and retry:
>
>  Chem.GetSymmSSSR(m)
>  m.HasSubstructMatch(pattern)
>  >>>False
>
> Chem.GetSymmSSSR(pattern)
> m.HasSubstructMatch(pattern)
> >>>False
>
> But as you can see without any luck. Is there anything else I can do
> to get the match anyway?
> Without having a match I can't aligh and higlight asprin substructure
> in CHEMBL1999443 image using GenerateDepictionMatching2DStructure and
> DrawMolecule functions.
>
> Kind regards,
>
> Michał Nowotka
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] single SMARTS for two patterns with Boolean OR

2017-09-19 Thread Chris Earnshaw
Hi

Will the recursive SMARTS [$(C-C),$(N-N)] not do the job?

I'd parse this in English as 'an atom which is EITHER an aliphatic carbon
singly bonded to an aliphatic carbon OR an aliphatic nitrogen singly bonded
to an aliphatic nitrogen'.

Regards,
Chris Earnshaw

On 19 September 2017 at 15:01, James T. Metz via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> Dante,
>
> Yes.  In principle, if one can figure out all of the possible
> undesired cross
> matches.
>
> Since my goal is to do this in RDkit and generate groups of atoms
> that match, perhaps one approach is to simply use multiple RDkit pattern
> matching statements (with multiple SMARTS), generate the groups of atoms,
> then combine the lists, removing identical groups.
>
> Hmmm... Is there a more straightforward (elegant) solution?
>
> Regards,
> Jim Metz
>
>
>
> -Original Message-
> From: Dante 
> To: James T. Metz 
> Cc: RDKit Discuss 
> Sent: Tue, Sep 19, 2017 8:45 am
> Subject: Re: [Rdkit-discuss] single SMARTS for two patterns with Boolean OR
>
> Hi Jim,
>
> Could you use the 'NOT' logical operator (!) in combination with recursive
> SMARTS to eliminate the cross-matches?
>
> Cheers,
>
> Dante
>
> On Tue, Sep 19, 2017 at 9:13 AM, James T. Metz via Rdkit-discuss <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
> Hello,
>
> Is it possible to write a single SMARTS for two separate patterns involving
> a Boolean OR?
>
> For example,  I want to write a single SMARTS that can match the
> patterns of
>
> [C]-[C]
>
> or
>
> [N]-[N]
>
> I realize that I could write something like
>
> [C,N]-[C,N]
>
> but that would also match "cross" patterns such as
>
> CN and NC which I don't want.
>
> I have tried to write
>
> ([C]-[C]), ([N\-[N])   but I have not been able to get that syntax or
> related
> expressions (variations of parentheses, brackets, etc) to work.
>
> Hence, if someone knows how to combine separate SMARTS expressions into
> a single expression with a Boolean OR, I would be grateful.  Thank you.
>
> Regards,
> Jim Metz
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Returning Z-matrix coordinates for a molecule in rdkit?

2017-09-19 Thread Chris Earnshaw
Hi

Open Babel will convert a wide range of structure formats and can produce
at least a couple of different flavours of Z-matrix, including MOPAC and
Gaussian. I'm not aware of any way to get a Z-matrix directly from RDKit
(but would be happy to find out I'm wrong).

Regards,
Chris Earnshaw

On 19 September 2017 at 16:15, Janusz Petkowski  wrote:

> Dear RDKit Community,
>
> I have a quick question. Is it possible to return a Z-matrix instead of
> the usual, Cartesian coordinates for a molecule in RDKit or do you know of
> any way of converting or generating Z-matrix coordinates for a batch of
> molecules?
>
> Thanks!
>
> Dr Janusz Petkowski
>
> Research Fellow at MIT EAPS <https://eapsweb.mit.edu/people/jjpetkow>
>
> Tel:  +1 (617) 258 - 6910 <%28857%29%20777-6977>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] single SMARTS for two patterns with Boolean OR

2017-09-19 Thread Chris Earnshaw
Hmm, that makes it distinctly trickier. Recursive SMARTS work at the level
of single atoms and their environment, rather than treating a group of
atoms together, so I suspect it isn't possible to create a single SMARTS to
give the match information you need in one go. I'd be very pleased to find
out I'm wrong!

If you could use a small script it's pretty trivial of course -
m = Chem.MolFromSmiles("CCNN")
p1 = Chem.MolFromSmarts("C-C")
p2 = Chem.MolFromSmarts("N-N")
m1 = m.GetSubstructMatches(p1)
m2 = m.GetSubstructMatches(p2)
m1 + m2

which gives ((0, 1), (2, 3)) as required, but if you have a specific need
for the 'single SMARTS' approach that's not much use. Sorry not to be more
helpful...

Chris Earnshaw


On 19 September 2017 at 16:50, James T. Metz  wrote:

> Chris,
>
> Thank you for your interesting suggestion, but it is not quite what I
> need.
>
> For example, consider the molecule
>
> m = Chem.MolFromSmiles("CCNN")
>
> I am looking for one SMARTS that using the SMARTS pattern matching
> capability in RDkit would return 2 groups, each group containing the two
> atoms corresponding to  CC  and  NN.
>
> Your suggested recursive SMARTS  and code below
>
> pattern = Chem.MolFromSmarts('[$(C-C),$(N-N)]')
> match = m.GetSubstructMatches(pattern)
> match
>
> returns
>
> ((0,), (1,), (2,), (3,))
>
>
> The output I am trying to achieve, instead, is
>
>
> ((0,1), (2,3))
>
>
> Is there a single SMARTS that will do that?
>
>
> Regards,
>
> Jim Metz
>
>
>
>
>
>
> -Original Message-
> From: Chris Earnshaw 
> To: James T. Metz 
> Cc: Rdkit-discuss@lists.sourceforge.net  sourceforge.net>
> Sent: Tue, Sep 19, 2017 10:13 am
> Subject: Re: [Rdkit-discuss] single SMARTS for two patterns with Boolean OR
>
> Hi
>
> Will the recursive SMARTS [$(C-C),$(N-N)] not do the job?
>
> I'd parse this in English as 'an atom which is EITHER an aliphatic carbon
> singly bonded to an aliphatic carbon OR an aliphatic nitrogen singly bonded
> to an aliphatic nitrogen'.
>
> Regards,
> Chris Earnshaw
>
> On 19 September 2017 at 15:01, James T. Metz via Rdkit-discuss <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
> Dante,
>
> Yes.  In principle, if one can figure out all of the possible
> undesired cross
> matches.
>
> Since my goal is to do this in RDkit and generate groups of atoms
> that match, perhaps one approach is to simply use multiple RDkit pattern
> matching statements (with multiple SMARTS), generate the groups of atoms,
> then combine the lists, removing identical groups.
>
> Hmmm... Is there a more straightforward (elegant) solution?
>
> Regards,
> Jim Metz
>
>
>
> -Original Message-
> From: Dante 
> To: James T. Metz 
> Cc: RDKit Discuss 
> Sent: Tue, Sep 19, 2017 8:45 am
> Subject: Re: [Rdkit-discuss] single SMARTS for two patterns with Boolean OR
>
> Hi Jim,
>
> Could you use the 'NOT' logical operator (!) in combination with recursive
> SMARTS to eliminate the cross-matches?
>
> Cheers,
>
> Dante
>
> On Tue, Sep 19, 2017 at 9:13 AM, James T. Metz via Rdkit-discuss <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
> Hello,
>
> Is it possible to write a single SMARTS for two separate patterns involving
> a Boolean OR?
>
> For example,  I want to write a single SMARTS that can match the
> patterns of
>
> [C]-[C]
>
> or
>
> [N]-[N]
>
> I realize that I could write something like
>
> [C,N]-[C,N]
>
> but that would also match "cross" patterns such as
>
> CN and NC which I don't want.
>
> I have tried to write
>
> ([C]-[C]), ([N\-[N])   but I have not been able to get that syntax or
> related
> expressions (variations of parentheses, brackets, etc) to work.
>
> Hence, if someone knows how to combine separate SMARTS expressions into
> a single expression with a Boolean OR, I would be grateful.  Thank you.
>
> Regards,
> Jim Metz
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's

Re: [Rdkit-discuss] need SMARTS query with a specific exclusion

2017-09-24 Thread Chris Earnshaw
Hi Jim

It can be done with recursive SMARTS, though the syntax is a bit
painful This may do what you want -
[$(a);!$(n1(C)ccc(=O)nc1=O);!$(c1cc(=O)nc(=O)n1C);!$(c1c(=O)nc(=O)n(C)c1);!$(c(=O)1nc(=O)n(C)cc1);!$(n1c(=O)n(C)ccc1=O);!$(c(=O)1n(C)ccc(=O)n1)]:1:a:a:a:a:a:1

Its basically the general 6-ring aromatic pattern a:1:a:a:a:a:a:1,
with recursive SMARTS applied to the first atom to ensure that this
can't match any of the 6 ring atoms in your undesired system.

Regards,
Chris Earnshaw

On 24 September 2017 at 05:04, James T. Metz via Rdkit-discuss
 wrote:
> Hello,
>
> Suppose I have the following molecule
>
> m = 'CN1C=CC(=O)NC1=O'
>
> I would like to be able to use a SMARTS pattern
>
> pattern = '[a]1:[a][a]:[a]:[a]:a]1'
>
> to recognize the 6 atoms in a typical aromatic ring, but
> I do not want to recognize the 6 atoms in the molecule,
> m, as aromatic.  In other words, I am trying to write
> a specific exclusion.
>
> Is it possible to modify the SMARTS pattern to
> exclude the above molecule?  I have tried using
> recursive SMARTS, but I can't get the syntax to
> work.
>
> Any ideas?  Thank you.
>
> Regards,
> Jim Metz
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] need SMARTS query with a specific exclusion

2017-09-24 Thread Chris Earnshaw
Hi Jim

The key thing to remember about the recursive SMARTS clauses is that
they only match one atom (the first), and the rest of the string
describes the environment in which that atom is located. So the clause
$(n1(C)ccc(=O)nc1=O) matches just the nitrogen atom - which has
embedded in the rest of the ring system. We then negate that with the
! symbol.

If we use just the recursive SMARTS expression '[$(a)]' (or the simple
SMARTS 'a'), it can match any of the six aromatic atoms in the
heterocycle. Adding the first exclusion '[$(a);!$(n1(C)ccc(=O)nc1=O)]'
means this atom can't match the nitrogen substituted by aliphatic
C,but it can still match any of the other five aromatic atoms.
Consequently there are five more exclusion clauses to add, each of
which starts with a different one of the aromatic atoms in your
undesired structure. As long as one of the atoms in the full SMARTS is
prevented from matching any of the atoms in the undesired structure in
this way, then the overall match is prevented.

Adding an exclusion for pyridine is then easy. We're already excluding
six patterns, and (considering symmetry) we only need to add four more
to exclude all pyridines. Appending
';!$(n1c1);!$(c1n1);!$(c1cnccc1);!$(c1ccncc1)' inside the
square brackets should do the trick.

You're quite right though, this gets pretty cumbersome very quickly
and it may well be best to handle it in code with simple include /
exclude SMARTS patterns. You'll have to think about checking which
atoms have been matched - for example, do you want to match quinoline
because it contains a benzene ring, or exclude it because it contains
a pyridine? If the former you'll have to check that the atoms matched
by your two patterns are different.

Hope this helps!

Chris Earnshaw

On 24 September 2017 at 15:01, James T. Metz  wrote:
> Chris,
>
> Wow! Your recursive SMARTS expression works as needed!
>
> Hmmm... Help me understand this better ... it looks like you "walk around"
> the
> ring of the substructure we want to exclude and employ a slightly different
> recursive SMARTS beginning at that atom.  Is that correct?
>
> Also, since my situation is likely to get more complicated with respect to
> exclusions, suppose I still wanted to utilize the general aromatic
> expression
> for a 6-membered ring i.e. [a]1:[a]:[a]:[a][a]:[a]1, and I wanted to exclude
> the structures we have been discussing, and I also wanted to exclude
> pyridine i.e., [n]1:[c]:[c]:[c]:[c]:[c]1.
>
> Is there a SMARTS expression that would capture 2 exclusions?
>
> Perhaps this is getting too clumsy!  It might be better to have one or more
> inclusion SMARTS and one or more exclusion SMARTS, and write code
> to remove those groups of atoms that are coming from the exclusion SMARTS.
>
> Any ideas for PYTHON/RDkit code?  Something like
>
> test_smiles = 'c1c1'
> inclusion_pattern = '[a]1:[a]:[a]:[a]:[a]:[a]1'
> exclusion_pattern = '[n]1:[c]:[c]:[c]:[c]:[c]1'
> etc...
>
> Hmmm... any other ideas, suggestions, comments?
>
> Thanks again.
>
> Regards,
> Jim Metz
>
>
>
>
> -Original Message-
> From: Chris Earnshaw 
> To: James T. Metz 
> Cc: Rdkit-discuss@lists.sourceforge.net
> 
> Sent: Sun, Sep 24, 2017 4:01 am
> Subject: Re: [Rdkit-discuss] need SMARTS query with a specific exclusion
>
> Hi Jim
>
> It can be done with recursive SMARTS, though the syntax is a bit
> painful This may do what you want -
> [$(a);!$(n1(C)ccc(=O)nc1=O);!$(c1cc(=O)nc(=O)n1C);!$(c1c(=O)nc(=O)n(C)c1);!$(c(=O)1nc(=O)n(C)cc1);!$(n1c(=O)n(C)ccc1=O);!$(c(=O)1n(C)ccc(=O)n1)]:1:a:a:a:a:a:1
>
> Its basically the general 6-ring aromatic pattern a:1:a:a:a:a:a:1,
> with recursive SMARTS applied to the first atom to ensure that this
> can't match any of the 6 ring atoms in your undesired system.
>
> Regards,
> Chris Earnshaw
>
> On 24 September 2017 at 05:04, James T. Metz via Rdkit-discuss
>  wrote:
>> Hello,
>>
>> Suppose I have the following molecule
>>
>> m = 'CN1C=CC(=O)NC1=O'
>>
>> I would like to be able to use a SMARTS pattern
>>
>> pattern = '[a]1:[a][a]:[a]:[a]:a]1'
>>
>> to recognize the 6 atoms in a typical aromatic ring, but
>> I do not want to recognize the 6 atoms in the molecule,
>> m, as aromatic. In other words, I am trying to write
>> a specific exclusion.
>>
>> Is it possible to modify the SMARTS pattern to
>> exclude the above molecule? I have tried using
>> recursive SMARTS, but I can't get the syntax to
>> work.
>>
>> Any ideas? Thank you.
>>
>> Regards,
>> Jim Metz
>>
>>
>>
>>
>> -

Re: [Rdkit-discuss] need SMARTS query with a specific exclusion

2017-09-24 Thread Chris Earnshaw
Hi

It amounts to the same thing - either do all tests on one atom, or one test
on all atoms.

The syntax is shorter for the latter if you can use the vector bindings but
may not be otherwise, especially if multiple exclusions are needed.

Regards,
Chris Earnshaw



On 24 Sep 2017 16:54, "David Cosgrove"  wrote:

Hi,
I think Chris' solution is a bit overly complicated, though I haven't
tested my alternative.  If each atom in the ring is tested for
'[$(a);!$(n1(C)ccc(=O)nc1=O)]', as you'd get if you expanded out the vector
bindings I provided previously, then I don't think you need to provide the
SMARTS for the excluded ring starting from each atom.  So long as 1 of the
atoms in the ring fails the test, the whole ring fails, so you just need
the same test on each atom.
Dave


On Sun, Sep 24, 2017 at 4:45 PM, Chris Earnshaw 
wrote:

> Hi Jim
>
> The key thing to remember about the recursive SMARTS clauses is that
> they only match one atom (the first), and the rest of the string
> describes the environment in which that atom is located. So the clause
> $(n1(C)ccc(=O)nc1=O) matches just the nitrogen atom - which has
> embedded in the rest of the ring system. We then negate that with the
> ! symbol.
>
> If we use just the recursive SMARTS expression '[$(a)]' (or the simple
> SMARTS 'a'), it can match any of the six aromatic atoms in the
> heterocycle. Adding the first exclusion '[$(a);!$(n1(C)ccc(=O)nc1=O)]'
> means this atom can't match the nitrogen substituted by aliphatic
> C,but it can still match any of the other five aromatic atoms.
> Consequently there are five more exclusion clauses to add, each of
> which starts with a different one of the aromatic atoms in your
> undesired structure. As long as one of the atoms in the full SMARTS is
> prevented from matching any of the atoms in the undesired structure in
> this way, then the overall match is prevented.
>
> Adding an exclusion for pyridine is then easy. We're already excluding
> six patterns, and (considering symmetry) we only need to add four more
> to exclude all pyridines. Appending
> ';!$(n1c1);!$(c1n1);!$(c1cnccc1);!$(c1ccncc1)' inside the
> square brackets should do the trick.
>
> You're quite right though, this gets pretty cumbersome very quickly
> and it may well be best to handle it in code with simple include /
> exclude SMARTS patterns. You'll have to think about checking which
> atoms have been matched - for example, do you want to match quinoline
> because it contains a benzene ring, or exclude it because it contains
> a pyridine? If the former you'll have to check that the atoms matched
> by your two patterns are different.
>
> Hope this helps!
>
> Chris Earnshaw
>
> On 24 September 2017 at 15:01, James T. Metz  wrote:
> > Chris,
> >
> > Wow! Your recursive SMARTS expression works as needed!
> >
> > Hmmm... Help me understand this better ... it looks like you "walk
> around"
> > the
> > ring of the substructure we want to exclude and employ a slightly
> different
> > recursive SMARTS beginning at that atom.  Is that correct?
> >
> > Also, since my situation is likely to get more complicated with respect
> to
> > exclusions, suppose I still wanted to utilize the general aromatic
> > expression
> > for a 6-membered ring i.e. [a]1:[a]:[a]:[a][a]:[a]1, and I wanted to
> exclude
> > the structures we have been discussing, and I also wanted to exclude
> > pyridine i.e., [n]1:[c]:[c]:[c]:[c]:[c]1.
> >
> > Is there a SMARTS expression that would capture 2 exclusions?
> >
> > Perhaps this is getting too clumsy!  It might be better to have one or
> more
> > inclusion SMARTS and one or more exclusion SMARTS, and write code
> > to remove those groups of atoms that are coming from the exclusion
> SMARTS.
> >
> > Any ideas for PYTHON/RDkit code?  Something like
> >
> > test_smiles = 'c1c1'
> > inclusion_pattern = '[a]1:[a]:[a]:[a]:[a]:[a]1'
> > exclusion_pattern = '[n]1:[c]:[c]:[c]:[c]:[c]1'
> > etc...
> >
> > Hmmm... any other ideas, suggestions, comments?
> >
> > Thanks again.
> >
> > Regards,
> > Jim Metz
> >
> >
> >
> >
> > -Original Message-
> > From: Chris Earnshaw 
> > To: James T. Metz 
> > Cc: Rdkit-discuss@lists.sourceforge.net
> > 
> > Sent: Sun, Sep 24, 2017 4:01 am
> > Subject: Re: [Rdkit-discuss] need SMARTS query with a specific exclusion
> >
> > Hi Jim
> >
> > It can be done with recursive SMARTS, though the syntax is a bit
> > painful Th

Re: [Rdkit-discuss] nitrogen valence issues

2017-10-05 Thread Chris Earnshaw
Hi

Be aware that there is a problem with one of the azide groups in
CHEMBL592333 - in SMILES it's '-N=[NH+]-[NH-]' rather than '-N=[N+]=[N-].
This doesn't render the structure chemically invalid but it's probably
wrong.

What's the provenance of your SD file? It isn't the same as as a fresh
download of this structure from CHEMBL, which can be processed by RDkit
quite happily (allowing for the structure being wrong!). Is it possible
that your file has got corrupted by some other processing step?

Regards,
Chris

On 5 October 2017 at 03:28, Greg Landrum  wrote:

> Hi Brian,
>
> When you pasted that into the email the formatting of the mol block did
> end up screwed up, which makes this hard to reproduce.
> Could you please attach the mol block to the message as a file?
>
> -greg
>
> On Thu, Oct 5, 2017 at 2:21 AM, Bennion, Brian  wrote:
>
>> Hello,
>>
>> After looking at the email list and seeing that this error has cropped up
>> several times for aromatic/aliphatic heterocyclic nitrogens I still haven’t
>> been able to resolve the valence =4 error for one of the azo groups in a
>> molecule that has 7.  The first couple of azo groups seem to be interpreted
>> fine.
>>
>> Am I doing something incorrect here or is the mol file not formatted
>> properly?
>>
>> Thanks
>>
>> Brian
>>
>>
>>
>>
>>
>> [16:50:29] Explicit valence for atom # 85 N, 4, is greater than permitted
>>
>> [16:50:29] ERROR: Could not sanitize molecule ending on line 206
>>
>> [16:50:29] ERROR: Explicit valence for atom # 85 N, 4, is greater than
>> permitted
>>
>>
>>
>> CHEMBL592333
>>
>>3D
>>
>>
>>
>> 91 98  0  0  0  0  0  0  0  0999 V2000
>>
>> 8.3826   -4.17890. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 7.6967   -2.89680. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 8.5551   -1.59260. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 9.9817   -1.64490. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>10.7075   -3.00510. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 9.8956   -4.25770. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 9.51452.28820. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 9.87980.82840. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>11.31180.39830. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>12.39051.38200. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>12.01872.92050. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>10.62723.32730. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 9.82044.38660. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 7.94905.47790. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 8.51794.08830. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 4.38225.05110. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 7.66732.88200. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 6.41505.65410. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 5.56804.42640. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 6.12623.08490. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>11.0958   -0.91180. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 4.5054   -4.06120. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 6.0427   -3.93760. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 6.8866   -5.11860. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 6.3028   -6.49020. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 4.7637   -6.64580. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 3.9029   -5.41820. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 2.7159   -5.98310. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 0.6356   -5.60260. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 1.9599   -4.88510. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 3.13313.0. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 3.07344.82220. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 1.82162.53730. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>-2.1064   -0.88350. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>-1.65800.62550. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>-1.8630   -3.03920. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>-1.1133   -1.95280. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 0.52943.27430. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 1.79765.54000. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 0.47904.74180. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>-0.19400.91290. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>>-0.28002.22970. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 0.3560   -1.63640. C   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 0.7840   -0.17610. O   0  0  0  0  0  0  0  0  0  0  0  0
>>
>> 

Re: [Rdkit-discuss] nitrogen valence issues

2017-10-05 Thread Chris Earnshaw
Hi

Some interesting differences in behaviour compared with my RDkit
installation. Using the ChEMBL SMILES (freshly downloaded now) -

[NH-][NH+]=NC[C@H]1O[C@@H]2O[C@@H]3[C@@H](CN=[N+]=[N-])O[C@H](O[C@@H]4[C@@H](CN=[N+]=[N-])O[C@H](O[C@@H]5[C@@H](CN=[N+]=[N-])O[C@H](O[C@@H]6[C@@H](CN=[N+]=[N-])O[C@H](O[C@@H]7[C@@H](CN=[N+]=[N-])O[C@H](O[C@@H]8[C@@H](CN=[N+]=[N-])O[C@H](O[C@H]1[C@H](O)[C@H]2O)[C@H](O)[C@H]8O)[C@H](O)[C@H]7O)[C@H](O)[C@H]6O)[C@H](O)[C@H]5O)[C@H](O)[C@H]4O)[C@H](O)[C@H]3O

The problem atoms are the first two. If I convert this to an SD file
(using a C++ program based on the RDkit libraries) then the resulting
SD file contains no charge information in the atom block, it's all in
M CHG records (which are correct) and the problem atoms are still the
first two.

The bonding information is incorrect as there's a single bond between
the two nitrogens and it should be double. Editing the first record in
the bond block from -
  1  2  1  0
to -
  1  2  2  0
fixes the structure for me, and the resulting SD file can be processed
by other RDkit programs. I've attached the resulting file in case it
helps throw any light on what's happening.

I'm puzzled as to why the behaviour is significantly different for you...

Chris

On 5 October 2017 at 15:40, Bennion, Brian  wrote:
> The sdf is an rdkit reading of the original smiles string, which if wrong
> would explain the funky charge settings in the mol block for atoms 84 and
> 85.  I modified these to 5 and 3 respectively to make the correct charge
> states, however, that did not resolve the issue.  Perhaps the bonding info
> is also incorrect.  The file is on a remote server so I will repost with
> attachment if I continue to have problems.
>
> Brian
>
>
> 
> From: Chris Earnshaw 
> Sent: Thursday, October 5, 2017 12:04:02 AM
> To: Bennion, Brian; RDKit Discuss (rdkit-discuss@lists.sourceforge.net)
> Subject: Re: [Rdkit-discuss] nitrogen valence issues
>
> Hi
>
> Be aware that there is a problem with one of the azide groups in
> CHEMBL592333 - in SMILES it's '-N=[NH+]-[NH-]' rather than '-N=[N+]=[N-].
> This doesn't render the structure chemically invalid but it's probably
> wrong.
>
> What's the provenance of your SD file? It isn't the same as as a fresh
> download of this structure from CHEMBL, which can be processed by RDkit
> quite happily (allowing for the structure being wrong!). Is it possible that
> your file has got corrupted by some other processing step?
>
> Regards,
> Chris
>
> On 5 October 2017 at 03:28, Greg Landrum  wrote:
>>
>> Hi Brian,
>>
>> When you pasted that into the email the formatting of the mol block did
>> end up screwed up, which makes this hard to reproduce.
>> Could you please attach the mol block to the message as a file?
>>
>> -greg
>>
>> On Thu, Oct 5, 2017 at 2:21 AM, Bennion, Brian  wrote:
>>>
>>> Hello,
>>>
>>> After looking at the email list and seeing that this error has cropped up
>>> several times for aromatic/aliphatic heterocyclic nitrogens I still haven’t
>>> been able to resolve the valence =4 error for one of the azo groups in a
>>> molecule that has 7.  The first couple of azo groups seem to be interpreted
>>> fine.
>>>
>>> Am I doing something incorrect here or is the mol file not formatted
>>> properly?
>>>
>>> Thanks
>>>
>>> Brian
>>>
>>>
>>>
>>>
>>>
>>> [16:50:29] Explicit valence for atom # 85 N, 4, is greater than permitted
>>>
>>> [16:50:29] ERROR: Could not sanitize molecule ending on line 206
>>>
>>> [16:50:29] ERROR: Explicit valence for atom # 85 N, 4, is greater than
>>> permitted
>>>
>>>
>>>
>>> CHEMBL592333
>>>
>>>3D
>>>
>>>
>>>
>>> 91 98  0  0  0  0  0  0  0  0999 V2000
>>>
>>> 8.3826   -4.17890. C   0  0  0  0  0  0  0  0  0  0  0  0
>>>
>>> 7.6967   -2.89680. O   0  0  0  0  0  0  0  0  0  0  0  0
>>>
>>> 8.5551   -1.59260. C   0  0  0  0  0  0  0  0  0  0  0  0
>>>
>>> 9.9817   -1.64490. C   0  0  0  0  0  0  0  0  0  0  0  0
>>>
>>>10.7075   -3.00510. C   0  0  0  0  0  0  0  0  0  0  0  0
>>>
>>> 9.8956   -4.25770. C   0  0  0  0  0  0  0  0  0  0  0  0
>>>
>>> 9.51452.28820. C   0  0  0  0  0  0  0  0  0  0  0  0
>>>
>>> 9.87980.82840. O   0  0  0  0  0  0  0  0  0  0 

Re: [Rdkit-discuss] nitrogen valence issues

2017-10-07 Thread Chris Earnshaw
As the problem only arises with a structure processed and written by MOE,
it looks like that's the cause of the difficulty. The input structure is
chemically valid (if structurally incorrect) and can be processed by RDkit
without any problems.

I've used RDkit reaction transformations in the past to fix this kind of
structure error before passing them to subsequent processing.

Chris


On 6 Oct 2017 23:32, "Bennion, Brian"  wrote:

> The rdkit sdf files were washed with MOE and then read into rdkit
> again for 3D structure generation.
> As there are only a dozen problem cases out of 1.5 million compounds, I
> just removed them from my main file and downloaded the mol files from
> chembl and double check the structures.
>
> Bran
>
> -----Original Message-
> From: Chris Earnshaw [mailto:cgearns...@gmail.com]
> Sent: Thursday, October 05, 2017 08:46
> To: Bennion, Brian 
> Cc: RDKit Discuss (rdkit-discuss@lists.sourceforge.net) <
> rdkit-discuss@lists.sourceforge.net>
> Subject: Re: [Rdkit-discuss] nitrogen valence issues
>
> Hi
>
> Some interesting differences in behaviour compared with my RDkit
> installation. Using the ChEMBL SMILES (freshly downloaded now) -
>
> [NH-][NH+]=NC[C@H]1O[C@@H]2O[C@@H]3[C@@H](CN=[N+]=[N-])O[C@H](O[C@@H]4[C@
> @H](CN=[N+]=[N-])O[C@H](O[C@@H]5[C@@H](CN=[N+]=[N-])O[C@H](O[C@@H]6[C@@H](
> CN=[N+]=[N-])O[C@H](O[C@@H]7[C@@H](CN=[N+]=[N-])O[C@H](O[C@@H]8[C@
> @H](CN=[N+]=[N-])O[C@H](O[C@H]1[C@H](O)[C@H]2O)[C@H](O)[C@H]8O)[C@H
> ](O)[C@H]7O)[C@H](O)[C@H]6O)[C@H](O)[C@H]5O)[C@H](O)[C@H]4O)[C@H](O)[C@H
> ]3O
>
> The problem atoms are the first two. If I convert this to an SD file
> (using a C++ program based on the RDkit libraries) then the resulting SD
> file contains no charge information in the atom block, it's all in M CHG
> records (which are correct) and the problem atoms are still the first two.
>
> The bonding information is incorrect as there's a single bond between the
> two nitrogens and it should be double. Editing the first record in the bond
> block from -
>   1  2  1  0
> to -
>   1  2  2  0
> fixes the structure for me, and the resulting SD file can be processed by
> other RDkit programs. I've attached the resulting file in case it helps
> throw any light on what's happening.
>
> I'm puzzled as to why the behaviour is significantly different for you...
>
> Chris
>
> On 5 October 2017 at 15:40, Bennion, Brian  wrote:
> > The sdf is an rdkit reading of the original smiles string, which if
> > wrong would explain the funky charge settings in the mol block for
> > atoms 84 and 85.  I modified these to 5 and 3 respectively to make the
> > correct charge states, however, that did not resolve the issue.
> > Perhaps the bonding info is also incorrect.  The file is on a remote
> > server so I will repost with attachment if I continue to have problems.
> >
> > Brian
> >
> >
> > 
> > From: Chris Earnshaw 
> > Sent: Thursday, October 5, 2017 12:04:02 AM
> > To: Bennion, Brian; RDKit Discuss
> > (rdkit-discuss@lists.sourceforge.net)
> > Subject: Re: [Rdkit-discuss] nitrogen valence issues
> >
> > Hi
> >
> > Be aware that there is a problem with one of the azide groups in
> > CHEMBL592333 - in SMILES it's '-N=[NH+]-[NH-]' rather than '-N=[N+]=[N-].
> > This doesn't render the structure chemically invalid but it's probably
> > wrong.
> >
> > What's the provenance of your SD file? It isn't the same as as a fresh
> > download of this structure from CHEMBL, which can be processed by
> > RDkit quite happily (allowing for the structure being wrong!). Is it
> > possible that your file has got corrupted by some other processing step?
> >
> > Regards,
> > Chris
> >
> > On 5 October 2017 at 03:28, Greg Landrum  wrote:
> >>
> >> Hi Brian,
> >>
> >> When you pasted that into the email the formatting of the mol block
> >> did end up screwed up, which makes this hard to reproduce.
> >> Could you please attach the mol block to the message as a file?
> >>
> >> -greg
> >>
> >> On Thu, Oct 5, 2017 at 2:21 AM, Bennion, Brian 
> wrote:
> >>>
> >>> Hello,
> >>>
> >>> After looking at the email list and seeing that this error has
> >>> cropped up several times for aromatic/aliphatic heterocyclic
> >>> nitrogens I still haven’t been able to resolve the valence =4 error
> >>> for one of the azo groups in a molecule that has 7.  The fir

Re: [Rdkit-discuss] Reaction changing bonds but not charges

2017-10-09 Thread Chris Earnshaw
Hi

I don't think the reaction SMARTS does specify a change in the
charges. The N and O have to be charged in order to match the reactant
pattern but they won't be altered by the transformation. If you
specify that the product atoms are explicitly neutral I think you'll
get the result you want -
[#8-:2]-[#7+:1]=[O:3]>>[O+0:2]=[N+0:1]=[O:3]

Chris Earnshaw

On 9 October 2017 at 15:57, Chris Murphy  wrote:
> Hi,
>
> I am using rdChemReactions to perform substructure transformations as
> defined by configurable reaction smarts. When I create the reaction and run
> a mol through it that I expect to be transformed by the indicated reaction
> smarts, I see that the bonds have been changed according to the smarts, but
> the charges are not changed.
>
> I am using the following smarts to define the transformation:
> [#8-:2]-[#7+:1]=[O:3]>>[O:2]=[N:1]=[O:3]
>
> The input molecule is as follows:
>
>   Mrv16c5 10061718252D
>
>   9  9  0  0  0  0999 V2000
>-0.5916   -0.06780. C   0  0  0  0  0  0  0  0  0  0  0  0
>-1.3060   -0.48030. C   0  0  0  0  0  0  0  0  0  0  0  0
>-1.3060   -1.30540. C   0  0  0  0  0  0  0  0  0  0  0  0
>-0.5916   -1.71790. C   0  0  0  0  0  0  0  0  0  0  0  0
> 0.1229   -1.30540. C   0  0  0  0  0  0  0  0  0  0  0  0
> 0.1229   -0.48030. C   0  0  0  0  0  0  0  0  0  0  0  0
>-0.59160.75720. N   0  3  0  0  0  0  0  0  0  0  0  0
>-1.30611.16970. O   0  5  0  0  0  0  0  0  0  0  0  0
> 0.12291.16970. O   0  0  0  0  0  0  0  0  0  0  0  0
>   1  2  2  0  0  0  0
>   2  3  1  0  0  0  0
>   3  4  2  0  0  0  0
>   4  5  1  0  0  0  0
>   5  6  2  0  0  0  0
>   6  1  1  0  0  0  0
>   1  7  1  0  0  0  0
>   7  8  1  0  0  0  0
>   7  9  2  0  0  0  0
> M  END
> 
>
> and the output ends up being:
>
>  RDKit  2D
>
>   9  9  0  0  0  0  0  0  0  0999 V2000
>-1.30611.16970. O   0  0  0  0  0  0  0  0  0  0  0  0
>-0.59160.75720. N   0  0  0  0  0  0  0  0  0  0  0  0
> 0.12291.16970. O   0  0  0  0  0  0  0  0  0  0  0  0
>-0.5916   -0.06780. C   0  0  0  0  0  0  0  0  0  0  0  0
>-1.3060   -0.48030. C   0  0  0  0  0  0  0  0  0  0  0  0
> 0.1229   -0.48030. C   0  0  0  0  0  0  0  0  0  0  0  0
>-1.3060   -1.30540. C   0  0  0  0  0  0  0  0  0  0  0  0
> 0.1229   -1.30540. C   0  0  0  0  0  0  0  0  0  0  0  0
>-0.5916   -1.71790. C   0  0  0  0  0  0  0  0  0  0  0  0
>   1  2  2  0
>   2  3  2  0
>   4  2  1  0
>   4  5  2  0
>   6  4  1  0
>   5  7  1  0
>   8  6  2  0
>   7  9  2  0
>   9  8  1  0
> M  CHG  2   1  -1   2   1
> M  END
>
> It seems the that NO bond is successfully converted to a double bond, but
> the charges have not been changed, even though the reaction smarts indicates
> that they N and O should be changed to neutral. Are there any glaring issues
> with using reactions to do transformations like this?
>
> Thanks!
>
> -Chris
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-08 Thread Chris Earnshaw
Hi

Surely the problem is that some of these SMILES aren't really valid. From
the Daylight theory manual: '*The bonds are numbered in any order,
designating ring opening (or ring closure) bonds by a digit immediately
following the atomic symbol at each ring closure'*  (my emphasis).

So the behaviour with SMILES where there is an atom between the ring
closure digit and the atom to which the ring closure applies (e.g.
[C@@](F)1(C)CCO1)
may well not be well defined. Arguably RDKit should refuse to process
these, but apparently it looks at the atom order and inverts the
stereochemistry instead. In Daylight SMILES the @ symbol refers to the
order of substituents around the asymmetric atom. If we swap the ring
closure digit and one of the atoms then we've changed the order of
connections and inverted the stereochemistry, so the current behaviour
seems reasonable. Personally I wouldn't change the behaviour - or get RDKit
to issue a warning that the SMILES isn't 'strict' in these cases.

I think the safest approach is to stick to SMILES which are unequivocally
valid, unless RDKit is going to create its own definition of SMILES...


Best regards,
Chris Earnshaw

On 9 November 2017 at 07:13, Greg Landrum  wrote:

>
> On Thu, Nov 9, 2017 at 6:32 AM, Brian Cole  wrote:
>
>> Hi Cheminformaticians,
>>
>> This is an extreme subtlety in the interpretation of SMILES atom
>> stereochemistry and I think a bug in RDKit. Specifically, I think the
>> following SMILES should be the same molecule:
>>
>> >>> rdkit.__version__
>> '2017.09.1'
>> >>> Chem.CanonSmiles('F[C@@]1(C)CCO1')
>> 'C[C@]1(F)CCO1'
>> >>> Chem.CanonSmiles('[C@@](F)1(C)CCO1')
>> 'C[C@@]1(F)CCO1'
>>
>
> As was discussed in the comments of https://github.com/rdkit/
> rdkit/issues/786, I think it's pretty gross that the second syntax is
> even legal. But that's a side point.
>
> Since there is no hydrogen inside the stereo carbon atom block the bond
>> being 'looked down' should be the first atom encountered. In both cases
>> above, that should be the Florine, therefore the molecules should be
>> equivalent.
>>
>
> Agreed, and this is a view that's further supported by this behavior:
>
> In [2]: Chem.CanonSmiles('F[C@@]1(C)CCO1')
> Out[2]: 'C[C@]1(F)CCO1'
>
> In [3]: Chem.CanonSmiles('F[C@@](C)1CCO1')
> Out[3]: 'C[C@@]1(F)CCO1'
>
> Would you mind filing a bug for this and I'll try to track it down/fix it?
>
> Thanks,
> -greg
>
>
>
>>
>> Though it could be argued the 2nd one is not strict SMILES as Andrew
>> describes here: https://github.com/rdkit/rdkit/issues/786
>>
>> It is useful when recombining fragments with ring closure digits for
>> these to be equivalent:
>> [*][C@]1(C)CCO1
>> [C@]([*])1(C)CCO1
>>
>> Also, every other tool I can get my hands on agrees they're the same:
>> OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough
>> canonicalization example for me to work from.)
>>
>> Sure wish there was a SMILES validation test suite we could all run
>> against. And so I'm attaching the examples I used to verify the above so
>> whatever poor soul assigned that task later can find this on Google. (I'm
>> hopeful :-)
>>
>> Thanks,
>> Brian
>>
>> PS: the current output from the script:
>>
>> $ python stereo_handling_first_atom.py
>> RDKit = 2017.09.1
>> OEChem = 2.1.2
>> OpenBabel = 2.4.1
>> indigo = 1.2.3.r0-g98188eb mac10.7
>> RDKit failed to recognize these as the same:
>> [*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2]
>> [C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2]
>> OpenBabel failed to recognize these as the same:
>> Cl[S@](C)=O -> C[S@](=O)Cl
>> [S@](Cl)(C)=O -> C[S@@](=O)Cl
>> Indigo failed to recognize these as the same:
>> Cl[S@](C)=O -> C[S@](=O)Cl
>> [S@](Cl)(C)=O -> C[S@@](=O)Cl
>> OpenBabel failed to recognize these as the same:
>> Cl[S@](C)= -> =[S@](Cl)C
>> [S@](Cl)(C)= -> =[S@@](Cl)C
>> Indigo failed to recognize these as the same:
>> Cl[S@](C)= -> =[S@@](C)Cl
>> [S@](Cl)(C)= -> =[S@](C)Cl
>> RDKit failed to recognize these as the same:
>> Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1
>> [C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1
>> RDKit failed to recognize these as the same:
>> Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1
>> [C@](Cl)1(c2c2)NCCCS1

Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

2017-11-09 Thread Chris Earnshaw
Trouble is, you're mixing chemical operations and lexical ones. It
might be handy if this 'just worked' but in practice it's not going to
produce valid SMILES without more work.

I've written code in the past to do this kind of thing for virtual
library building, using dummy atoms to mark link positions in the
fragments, and using Perl code to transform between the dummy atoms
and bond-closure numbers to give text strings which could be assembled
to give valid dot-disconnected SMILES. This required additional
lexical transformations in order to maintain valid SMILES depending on
where the dummy atom was, and to make sure that stereochemistry worked
properly. If you want to do this kind of thing I don't think you can
expect to avoid these additional lexical operations.

I don't think it's reasonable to expect that invalid SMILES strings
should be coerced into giving a particular result for convenience when
1) - they're invalid! and 2) - the behaviour is actually a reasonable
interpretation of the order of connections in the SMILES (even though
they are invalid).

I don't think the current RDKit interpretation of these SMILES should
change, though it might be useful if it could issue a warning that
SMILES of this type are not correct.

Best regards,
Chris

On 9 November 2017 at 15:09, Brian Cole  wrote:
> Here's an example of why this is useful at maintaining molecular
> fragmentation inside your molecular representation:
>
 from rdkit import Chem
 smiles = 'F9.[C@]91(C)CCO1'
 fluorine, core = smiles.split('.')
 fluorine
> 'F9'
 fragment = core.replace('9', '([*:9])')
 fragment
> '[C@]([*:9])1(C)CCO1'
 mol = Chem.RWMol(Chem.MolFromSmiles(fragment))  ### RDKit is flipping
 the stereo on me here even the order of the bonds has not changed
 idx = mol.AddAtom(Chem.Atom(0))
 mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)
> 7
 mol.GetAtomWithIdx(idx).SetIntProp("molAtomMapNumber", 8)
 new_core = Chem.MolToSmiles(mol, True)
 new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8')
 new_core
> 'C[C@]19CC8O1'
 analog_smiles = 'Cl8.' + fluorine + '.' + new_core
 analog_smiles
> 'Cl8.F9.C[C@]19CC8O1'
 analog = Chem.MolFromSmiles(analog_smiles)
 analog.HasSubstructMatch(Chem.MolFromSmiles(smiles), useChirality=True)
 # Uh oh! My original molecule didn't match
> False
 analog.HasSubstructMatch(Chem.MolFromSmiles(smiles.replace('@', '@@')),
 useChirality=True)   # flipping the stereo of the original causes it to
 match again
> True
>
>
>
>
> On Thu, Nov 9, 2017 at 4:41 AM, Andrew Dalke 
> wrote:
>>
>> On Nov 9, 2017, at 08:13, Greg Landrum  wrote:
>> > As was discussed in the comments of
>> > https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that
>> > the second syntax is even legal. But that's a side point.
>>
>> To belabor that point. Neither Daylight SMILES nor OpenSMILES accept it,
>> which are the only two explicit sources of "legal" that people use.
>>
>> "allowed" might be a better term.
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Having trouble getting RDKIT to recognize LiAsF6

2017-11-21 Thread Chris Earnshaw
Hi

A quick follow-up to my last message, w.r.t. halogens. The issue is
not that higher oxidation state halogen ions can't be constructed at
all, but they end up with bizarre charges on the halogen which are
frequently not recognised by other software.

Chris

On 21 November 2017 at 08:22, Chris Earnshaw  wrote:
> Hi
>
> The entries for P and As in RDKit's atomic_data.cpp are -
> 15  P   0.750.892.0830.974  5   31
> 30.97376163 3   5   7
> 33  As  1.211.2 1.8574.922  5   75
> 74.9215965   3   5
>
> So the required 'connectivities' for PF6- are present in the file
> (final 3 values, '7' being the relevant one here), but not for As. I'm
> not quite sure why this should be so as the phosphorus is P(V) in PF6,
> but this appears to be the cause. You'll have to add the '7' in the As
> entry and rebuild RDKit to allow AsF6- to be processed. A similar fix
> would be needed for antimony too. Someone please let me know if
> there's a more convenient solution!
>
> There are similar issues for the halogens as well. Only iodine has
> values > 1, so by default it's not possible to construct e.g.
> chlorates, or bromates, and no perhalates are allowed.
>
> Regards,
> Chris Earnshaw
>
> On 20 November 2017 at 23:03, Yoolhee Kim  wrote:
>> Hello,
>>
>> I'm trying to get RDKIT to recognize LiAsF6 from the SMILES Formula
>> "[Li+].F[As-](F)(F)(F)(F)F" but it seems that RDKIT will not read it
>> correctly and is unable to create a molecule object.
>>
>> RDKIT is able to understand LiPF6 from the SMILES formula
>> "[Li+].F[P-](F)(F)(F)(F)F" so I tried using that and replacing P with As
>> using both replaceatom and replacesubstruct functions, but that had some
>> issues as well.
>>
>> Has anyone faced a similar issue or have any suggestions?
>>
>> Thanks!
>>
>> -Yoolhee
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Hypervalent halogen structures - chlorate etc.

2017-11-21 Thread Chris Earnshaw
Hi

Sometime between 2014 and now there appears to have been a change in
the way hypervalent halogen structures are handled. The old behaviour
(involving some tweaking of atomic_data.cpp to allow the higher
oxidation states) was to have a neutral halogen with double bonds to
most of the oxygens, e.g.
chlorate O=Cl(=O)[O-]
perchlorate O=Cl(=O)(=O)[O-]

The current behaviour is to 'charge separate' the dative bonds, giving
chlorate [O-][Cl2+]([O+])[O-]
perchlorate [O-][Cl3+]([O-])([O+])[O-]

Although this may be regarded as 'correct' (arguable!), it  can cause
problems of compatibility with other software and looks remarkably
ugly. It's also inconsistent with the handling of hypervalent P and S
compounds. Using the same convention, we should have -

trimethylphosphine oxide C[P+]([O-])(C)C
dimethylsulfoxide C[S+]([O-])C
dimethylsulfone C[S2+]([O-])([O-])C

- and I really don't want this to happen!

Does anyone know a way to restore the old behaviour for chlorites,
bromates, periodates etc.?

Best regards,
Chris Earnshaw

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Hypervalent halogen structures - chlorate etc.

2017-11-21 Thread Chris Earnshaw
Hi

Oops - typo! The chlorate and perchlorate oxygens should all be [O-].
I was copying by hand from a different computer...

Regards,
Chris

On 21 November 2017 at 09:19, Paolo Tosco  wrote:
> Hi Chris,
>
> if the behaviour with chlorate and perchlorate is the one you report
>
> chlorate [O-][Cl2+]([O+])[O-]
> perchlorate [O-][Cl3+]([O-])([O+])[O-]
>
> it looks wrong to me as there is an overall formal charge of +1. All O's
> should bear a -1 charge.
>
> Cheers,
> p.
>
>
>
> On 11/21/17 09:12, Chris Earnshaw wrote:
>>
>> Hi
>>
>> Sometime between 2014 and now there appears to have been a change in
>> the way hypervalent halogen structures are handled. The old behaviour
>> (involving some tweaking of atomic_data.cpp to allow the higher
>> oxidation states) was to have a neutral halogen with double bonds to
>> most of the oxygens, e.g.
>> chlorate O=Cl(=O)[O-]
>> perchlorate O=Cl(=O)(=O)[O-]
>>
>> The current behaviour is to 'charge separate' the dative bonds, giving
>> chlorate [O-][Cl2+]([O+])[O-]
>> perchlorate [O-][Cl3+]([O-])([O+])[O-]
>>
>> Although this may be regarded as 'correct' (arguable!), it  can cause
>> problems of compatibility with other software and looks remarkably
>> ugly. It's also inconsistent with the handling of hypervalent P and S
>> compounds. Using the same convention, we should have -
>>
>> trimethylphosphine oxide C[P+]([O-])(C)C
>> dimethylsulfoxide C[S+]([O-])C
>> dimethylsulfone C[S2+]([O-])([O-])C
>>
>> - and I really don't want this to happen!
>>
>> Does anyone know a way to restore the old behaviour for chlorites,
>> bromates, periodates etc.?
>>
>> Best regards,
>> Chris Earnshaw
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Hypervalent halogen structures - chlorate etc.

2017-11-22 Thread Chris Earnshaw
Thanks Greg, I suspected that might be the case. I certainly don't
want to turn off all sanitization, but I could write a bit of code to
convert these salts back to my desired form as a separate step.

Alternatively, I guess I could disable or rewrite halogenCleanup() in
MolOps.cpp. Is this the only function which would need to be changed?

Best regards,
Chris

On 22 November 2017 at 07:59, Greg Landrum  wrote:
> At the moment the only way to do this is to disable the "cleanup"
> functionality in SanitizeMol(). This can be done, but it will also have the
> consequence that things like the hypervalent N in nitro groups (i.e.
> "-N(=O)=O") is not cleaned up.
>
> In [2]: m = Chem.MolFromSmiles('O=Cl(=O)(=O)[O-]',sanitize=False)
>
> In [3]: m.UpdatePropertyCache(strict=False)
>
> In [4]:
> Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_PROPERTIES)
> Out[4]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
>
> In [5]: Chem.MolToSmiles(m)
> Out[5]: 'O=Cl(=O)(=O)[O-]'
>
> In [6]: m2 = Chem.MolFromSmiles('CN(=O)=O',sanitize=False)
>
> In [7]: m2.UpdatePropertyCache(strict=False)
>
> In [8]:
> Chem.SanitizeMol(m2,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_PROPERTIES)
> Out[8]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
>
> In [10]: Chem.MolToSmiles(m2)
> Out[10]: 'CN(=O)=O'
>
>
> Note that this way of doing things disables all "unreasonable" valence
> checking.
>
> -greg
>
>
> On Tue, Nov 21, 2017 at 10:12 AM, Chris Earnshaw 
> wrote:
>>
>> Hi
>>
>> Sometime between 2014 and now there appears to have been a change in
>> the way hypervalent halogen structures are handled. The old behaviour
>> (involving some tweaking of atomic_data.cpp to allow the higher
>> oxidation states) was to have a neutral halogen with double bonds to
>> most of the oxygens, e.g.
>> chlorate O=Cl(=O)[O-]
>> perchlorate O=Cl(=O)(=O)[O-]
>>
>> The current behaviour is to 'charge separate' the dative bonds, giving
>> chlorate [O-][Cl2+]([O+])[O-]
>> perchlorate [O-][Cl3+]([O-])([O+])[O-]
>>
>> Although this may be regarded as 'correct' (arguable!), it  can cause
>> problems of compatibility with other software and looks remarkably
>> ugly. It's also inconsistent with the handling of hypervalent P and S
>> compounds. Using the same convention, we should have -
>>
>> trimethylphosphine oxide C[P+]([O-])(C)C
>> dimethylsulfoxide C[S+]([O-])C
>> dimethylsulfone C[S2+]([O-])([O-])C
>>
>> - and I really don't want this to happen!
>>
>> Does anyone know a way to restore the old behaviour for chlorites,
>> bromates, periodates etc.?
>>
>> Best regards,
>> Chris Earnshaw
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Hypervalent 2nd row element (and higher) representation / sanitization

2017-11-23 Thread Chris Earnshaw
Following a recent brief discussion about hypervalent halogen salt handing
in RDKit (chlorates, periodates etc.) I've been thinking about my
preferences for representation of hypervalent structures in general,
including more common groups like phosphorus(V) compounds, sulfoxides,
sulfones etc., as well as how they should be sanitized by RDKit

It might be useful to have a general discussion about how RDKit should
handle these systems. A 'one size fits all' solution which everyone agrees
on is, unfortunately, likely to be quite impossible.

A brief summary of my thoughts:
- we have to use the dative bond representation for nitro compounds because
N has no accessible d-orbitals, so the hypervalent -N(=O)=O representation
is 'wrong'
- P, S, Cl (and higher congeners) do have accessible d-orbitals, so
hypervalent representations for these compounds are not intrinsically
wrong, it's a matter of convention (and interoperability) whether we use
dative bond or hypervalent representations, e.g. C[S+]([O-])C or CS(=O)C
for DMSO.

My personal preference is to use hypervalent representations in the
majority of cases, e.g.
chlorate O=Cl(=O)[O-] instead of [O-][Cl+2]([O-])[O-]
periodate O=I(=O)(=O)[O-] instead of [O-][I+3]([O-])([O-])[O-]
iodosobenzene c1c1I=O instead of c1c1[I+][O-]
dimethylsulfone CS(=O)(=O)C instead of C[S+2]([O-])([O-])C
trimethylphosphine oxide CP(=O)(C)C instead of C[P+]([O-]))C)C
etc. etc.

There are also a few cases which come down purely to personal preference
and I generally use these guidelines:
- salt anions have any residual negative charge on O where possible, so
thiosulfate ends up as O=S([O-])([O-])=S rather than O=S(=O)([O-])[S-]
- carbanions adjacent to sulfonyl or phosphoryl groups have the charge on
the carbon
- sulfur and phosphorus ylids are represented as charge separated, e.g
trimethylsulfonium ylide is C[S+](C)[C-] rather than CS(C)=C.

Currently, RDKit will convert the hypervalent representation of the halogen
acids into dative bond form, leave sulfur compounds untouched, and for
phosphorus only convert the 'metaphosphate' structures [C,N]=P(C)=O to
[C,N]=[P+](C)[O-].

As an experiment, I've created a modified version of MolOps.cpp which does
all of my preferred conversions above (with the exception of moving charge
in thiosulfates from S to O if the input structure was already
hypervalent). It has changes to the functions phosphorusCleanup(),
halogenCleanup(), cleanUp() and a new function sulfurCleanup(). If anyone
is interested (and with Greg's permission), I'll share a Google drive link
to the file so others can try it out.

Note that a few tests will fail with the new MolOps.cpp:
- testMMFFForceField (does some checks on dative bond forms which
presumably now get converted)
- graphmolMolOpsTest (builds perchlorates etc. and expects the result to be
in dative bond form)
- pythonTestDirChem (not sure what's wrong with this one - I can't find
what it does!)

Apologies for the length of all this...

Chris Earnshaw
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Hypervalent 2nd row element (and higher) representation / sanitization

2017-11-23 Thread Chris Earnshaw
Hi Steve

Glad there's some interest!

At the moment the default behaviour of RDKit would be to leave the input
representation of your structures unchanged.

My modified code would change a dative bond representation of the sulfinate
to hypervalent (C[S+]([O-])[O-] -> CS(=O)[O-], but would do nothing with
the P-H phosphinic acid. Ideally it should convert C[PH+](*)[O-]  to
C[PH](*)=O but doesn't - I'll try to have a look at this.

Of course there's a tautomerization question for these phosphorus compounds
as well (I did quite a lot of work with some of them decades ago when I was
a 'real' chemist!) and you may well want to convert e.g. CP(O)(O) to
C[PH](=O)O at some point. That's a separate issue from sanitization -
probably best handled with a specific reaction transformation when you need
it.

Best regards,
Chris

On 23 November 2017 at 17:49, Stephen Roughley 
wrote:

> Chris,
>
> This is great! Maybe this should be an option in MolSanitize?  Also, a
> couple of others to watch out for - phosphinic acids (e.g. C[PH]=O ) and
> sulfinic acids (e.g. CS(=O)[OH]).
>
> I'm not sure what RDKit currently does with those, but it would be worth
> incorporating them into any solution / test set?
>
> Yours,
>
> Steve
>
> On Thu, Nov 23, 2017 at 5:27 PM, Chris Earnshaw 
> wrote:
>
>> Following a recent brief discussion about hypervalent halogen salt
>> handing in RDKit (chlorates, periodates etc.) I've been thinking about my
>> preferences for representation of hypervalent structures in general,
>> including more common groups like phosphorus(V) compounds, sulfoxides,
>> sulfones etc., as well as how they should be sanitized by RDKit
>>
>> It might be useful to have a general discussion about how RDKit should
>> handle these systems. A 'one size fits all' solution which everyone agrees
>> on is, unfortunately, likely to be quite impossible.
>>
>> A brief summary of my thoughts:
>> - we have to use the dative bond representation for nitro compounds
>> because N has no accessible d-orbitals, so the hypervalent -N(=O)=O
>> representation is 'wrong'
>> - P, S, Cl (and higher congeners) do have accessible d-orbitals, so
>> hypervalent representations for these compounds are not intrinsically
>> wrong, it's a matter of convention (and interoperability) whether we use
>> dative bond or hypervalent representations, e.g. C[S+]([O-])C or CS(=O)C
>> for DMSO.
>>
>> My personal preference is to use hypervalent representations in the
>> majority of cases, e.g.
>> chlorate O=Cl(=O)[O-] instead of [O-][Cl+2]([O-])[O-]
>> periodate O=I(=O)(=O)[O-] instead of [O-][I+3]([O-])([O-])[O-]
>> iodosobenzene c1c1I=O instead of c1c1[I+][O-]
>> dimethylsulfone CS(=O)(=O)C instead of C[S+2]([O-])([O-])C
>> trimethylphosphine oxide CP(=O)(C)C instead of C[P+]([O-]))C)C
>> etc. etc.
>>
>> There are also a few cases which come down purely to personal preference
>> and I generally use these guidelines:
>> - salt anions have any residual negative charge on O where possible, so
>> thiosulfate ends up as O=S([O-])([O-])=S rather than O=S(=O)([O-])[S-]
>> - carbanions adjacent to sulfonyl or phosphoryl groups have the charge on
>> the carbon
>> - sulfur and phosphorus ylids are represented as charge separated, e.g
>> trimethylsulfonium ylide is C[S+](C)[C-] rather than CS(C)=C.
>>
>> Currently, RDKit will convert the hypervalent representation of the
>> halogen acids into dative bond form, leave sulfur compounds untouched, and
>> for phosphorus only convert the 'metaphosphate' structures [C,N]=P(C)=O to
>> [C,N]=[P+](C)[O-].
>>
>> As an experiment, I've created a modified version of MolOps.cpp which
>> does all of my preferred conversions above (with the exception of moving
>> charge in thiosulfates from S to O if the input structure was already
>> hypervalent). It has changes to the functions phosphorusCleanup(),
>> halogenCleanup(), cleanUp() and a new function sulfurCleanup(). If anyone
>> is interested (and with Greg's permission), I'll share a Google drive link
>> to the file so others can try it out.
>>
>> Note that a few tests will fail with the new MolOps.cpp:
>> - testMMFFForceField (does some checks on dative bond forms which
>> presumably now get converted)
>> - graphmolMolOpsTest (builds perchlorates etc. and expects the result to
>> be in dative bond form)
>> - pythonTestDirChem (not sure what's wrong with this one - I can't find
>> what it does!)
>>
>> Apologies for the length of all this...
>>
>> Chris E

Re: [Rdkit-discuss] Having trouble getting RDKIT to recognize LiAsF6

2017-11-25 Thread Chris Earnshaw
Hi Yoolhee

If this is something you need fixed *now*, you have access to the RDKit
source code and you're confident to build a new version of RDKit, then try
the following.

1. from the root folder of the RDkit distribution, find the folder
Code/GraphMol/
2. in that folder, copy atomic_data.cpp to atomic_data.cpp.bak (so you can
restore things if necessary)
3. edit atomic_data.cpp as follows
  - find the line for phosphorus that starts '15   P' (probably line 42
in the file)
  - note that the line ends '3   5   7 \n \'
  - find the line for arsenic, starting '33  As' (probably line 60)
  - this line ends '3  5 \n \'
  - change the line by inserting 7 immediately after the 5 (that's a
tab character)
  - the end of the arsenic line should now look like the end of the
phosphorus line
  - save the file
4. rebuild RDKit

I wouldn't recommend trying this if you haven't previously built RDKit from
source.


@Greg - is it worth adding higher valence states for the halogens as well
while fixing As and Sb? At the moment we have
Cl  1
Br  1
I   1  2  5
Shouldn't these all be '1  3  5  7' ?

Best regards,
Chris

On 24 November 2017 at 21:31, Yoolhee Kim  wrote:

> Chris,
>
> Thank you very much for your reply! I'm not very familiar with RDKIT, and
> I was wondering if you could elaborate how to fix the problem of adding '7'
> in the As entry so that the valence state is recognized.
>
> Best regards,
> Yoolhee Kim
>
> On Tue, Nov 21, 2017 at 8:22 AM, Chris Earnshaw 
> wrote:
>
>> Hi
>>
>> The entries for P and As in RDKit's atomic_data.cpp are -
>> 15  P   0.750.892.0830.974  5   31
>> 30.97376163 3   5   7
>> 33  As  1.211.2 1.8574.922  5   75
>> 74.9215965   3   5
>>
>> So the required 'connectivities' for PF6- are present in the file
>> (final 3 values, '7' being the relevant one here), but not for As. I'm
>> not quite sure why this should be so as the phosphorus is P(V) in PF6,
>> but this appears to be the cause. You'll have to add the '7' in the As
>> entry and rebuild RDKit to allow AsF6- to be processed. A similar fix
>> would be needed for antimony too. Someone please let me know if
>> there's a more convenient solution!
>>
>> There are similar issues for the halogens as well. Only iodine has
>> values > 1, so by default it's not possible to construct e.g.
>> chlorates, or bromates, and no perhalates are allowed.
>>
>> Regards,
>> Chris Earnshaw
>>
>> On 20 November 2017 at 23:03, Yoolhee Kim 
>> wrote:
>> > Hello,
>> >
>> > I'm trying to get RDKIT to recognize LiAsF6 from the SMILES Formula
>> > "[Li+].F[As-](F)(F)(F)(F)F" but it seems that RDKIT will not read it
>> > correctly and is unable to create a molecule object.
>> >
>> > RDKIT is able to understand LiPF6 from the SMILES formula
>> > "[Li+].F[P-](F)(F)(F)(F)F" so I tried using that and replacing P with As
>> > using both replaceatom and replacesubstruct functions, but that had some
>> > issues as well.
>> >
>> > Has anyone faced a similar issue or have any suggestions?
>> >
>> > Thanks!
>> >
>> > -Yoolhee
>> >
>> > 
>> --
>> > Check out the vibrant tech community on one of the world's most
>> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> > ___
>> > Rdkit-discuss mailing list
>> > Rdkit-discuss@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> >
>>
>
>
>
> --
> Yoolhee Kim
> M.S. in Energy Science Technology and Policy
> Carnegie Mellon University
> (714) 326-5820 | yoolh...@andrew.cmu.edu
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] edge matrix

2018-01-17 Thread Chris Earnshaw
I don't think there's a way to do this using RDKit itself, but it appears
to be straightforward using Python with numpy and networkx, e.g.

import numpy as np
import networkx as nx
a = np.matrix([[0, 1, 0, 0, 0],[1, 0, 1, 1, 0],[0, 1, 0, 0, 0],[0, 1, 0, 0,
1],[0, 0, 0, 1, 0]])
b = nx.from_numpy_matrix(a)
lg = nx.line_graph(b)
ea = nx.adjacency_matrix(lg)
ea

matrix([[ 0.,  1.,  1.,  0.],
[ 1.,  0.,  1.,  0.],
[ 1.,  1.,  0.,  1.],
[ 0.,  0.,  1.,  0.]])

Hope this helps - but I'm way out of my depth here!

Best regards,
Chris


On 17 January 2018 at 16:57, Mario Lovrić  wrote:

> Correct, I am looking for a rdkit-hidden-option to do it :D
>
> On Wed, Jan 17, 2018 at 5:56 PM, Jason Biggs 
> wrote:
>
>> I am a novice when it comes to graph theory, but it seems like what is
>> wanted here is the adjacency matrix of the corresponding line graph (
>> http://mathworld.wolfram.com/LineGraph.html).
>>
>> I don't know how to do this in python, but if I use mathematica, it goes
>> like this
>>
>> adjacencyMatrix = {{0, 1, 0, 0, 0}, {1, 0, 1, 1, 0}, {0, 1, 0, 0,
>> 0}, {0, 1, 0, 0, 1}, {0, 0, 0, 1, 0}};
>>
>> graph = AdjacencyGraph[adjacencyMatrix];
>> lineGraph = LineGraph[graph];
>> AdjacencyMatrix[lineGraph] // MatrixForm
>>
>> [image: Inline image 1]
>>
>>
>> Jason Biggs
>>
>>
>> On Wed, Jan 17, 2018 at 10:21 AM, Marta Stępniewska-Dziubińska via
>> Rdkit-discuss  wrote:
>>
>>> Hi Mario,
>>>
>>> What exactly do you mean by 'edge matrix'? Are you sure you provided a
>>> correct example? If you want to get an adjacency matrix of a molecular
>>> graph you can iterate over bonds to get it:
>>>
>>> from rdkit.Chem import MolFromSmiles
>>> import numpy as np
>>> m = MolFromSmiles('CC(C)CC')
>>> n = m.GetNumAtoms()
>>> E = np.zeros((n, n))
>>> for b in m.GetBonds():
>>> i = b.GetBeginAtomIdx()
>>> j = b.GetEndAtomIdx()
>>> E[[i,j], [j,i]] = 1
>>>
>>>
>>> Hope this helps,
>>> Marta SD
>>>
>>>
>>>
>>> 2018-01-17 16:31 GMT+01:00 Mario Lovrić :
>>>
 Dear all,

 Does any one have an idea how to get an edge matrix (graph theory) out
 of Rdkit, I digged deep but didnt find anything.

 F.example for:

 'CC(C)CC'


 it would be:

 array([[0, 1, 1, 0],
[1, 0, 1, 0],
[1, 1, 0, 1],
[0, 0, 1, 0]])

 Thanks.


 --
 Mario Lovrić

 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>
>
> --
> Mario Lovrić
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] changing atomic charges with ReactionFromSmarts

2018-01-25 Thread Chris Earnshaw
Hi Jan

Your code doesn't change the charges because the reaction SMARTS doesn't
tell it to. If you say -

rxn_smarts = ['[N+:1]=[*:2]-[O-:3]>>[N+0:1]-[*:2]=[O+0:3]']

- the charges in the product are explicitly defined and you should get the
result you expect.

Best regards,
Chris

On 25 January 2018 at 10:18, Jan Halborg Jensen  wrote:

>
> The following code changes the bond order correctly but does not change
> the charges accordingly
>
> Any idea what I am doing wrong?
>
> Thanks, Jan
>
>
> def clean_charges(mol):
> rxn_smarts = ['[N+:1]=[*:2]-[O-:3]>>[N:1]-[*:2]=[O:3]']
>
> for smarts in rxn_smarts:
> rxn = AllChem.ReactionFromSmarts(smarts)
> ps = rxn.RunReactants((mol,))
> for x in ps:
> mol = x[0]
>
> #rdmolops.SanitizeMol(mol)
> return mol
>
> mol = Chem.MolFromSmiles("C[NH+]=C(C)[O-]")
>
> mol = clean_charges(mol)
> print Chem.MolToSmiles(mol)
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] remove salts + neutralize

2018-03-14 Thread Chris Earnshaw
The minuses are right. These are the single bonds between the individual
aromatic rings and this representation is strictly correct. The OpenBabel
representation doesn't mark these bonds as explicitly single and, as
they're between two aromatic atoms, the bond type could be inferred to be
aromatic. In principle this could cause problems elsewhere.

In short, RDKit is explicit and correct, OpenBabel is probably OK but
relies on any software reading its SMILES interpreting the ambiguity
correctly.

Best regards,
Chris

Chris

On 14 March 2018 at 15:40, Mario Lovrić  wrote:

> Dear all,
>
>
> I was trying to remove salts + neutralize several structures:
>
> O.[K+].NC(=O)C1=NC(=N[N-]1)C1=CC=CC(=C1)C1=CC(F)=CC=C1OCC(F)(F)C(F)(F)F
>
> -->remove salts-->
>
> NC(=O)c1nc(-c2(-c3cc(F)ccc3OCC(F)(F)C(F)(F)F)c2)n[n-]1
>
> --> neutralize-->
>
> NC(=O)c1nc(-c2(-c3cc(F)ccc3OCC(F)(F)C(F)(F)F)c2)n[nH]1
>
>
> The minuses appear front of the "c"s.
> I guess it is considered as bonds everywhere.
>
> When canonicalizing them in RDKit they dont change, but OpenBabel does for
> example.
>
> Fc1ccc(c(c1)c1(c1)c1n[nH]c(n1)C(=O)N)OCC(C(F)(F)F)(F)F
>
>
> Might it cause problems anywhere?
>
>
> KR
> --
> Mario Lovrić
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] generate conformes with a restrained core

2018-03-24 Thread Chris Earnshaw
Hi Felipe

You're doing something similar to the problem Paolo addressed.

ConstrainedEmbed (see
http://www.rdkit.org/Python_Docs/rdkit.Chem.AllChem-module.html#ConstrainedEmbed)
requires a mol object as the first parameter, but you are passing it an
integer cid value, not a molecule. Your code must retrieve the molecule
associated with the cid number and pass that to ConstrainedEmbed instead of
the cid number.

Regards,
Chris

On 23 March 2018 at 21:58, Felipe Trajtenberg  wrote:

> Hi Luan!
> thanks!
>
> I was trying to use it but I am still struggling with very basic problems.
> What you are saying is that I can delete the UFFOptimizeMolecule and add a
> loop with ConstrainedEmbed to minimize with constraints?
>
> in my script, using:
>
> for cid in cids:
> AllChem.ConstrainedEmbed(cid,coreMol,useTethers=True)
>
> I get the error:
>
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/export/home/felipet/.conda/envs/my-rdkit-env/lib/python2.
> 7/site-packages/rdkit/Chem/AllChem.py", line 274, in ConstrainedEmbed
> match = mol.GetSubstructMatch(core)
> AttributeError: 'int' object has no attribute 'GetSubstructMatch'
>
> what I am missing?
>
> thanks!
>
> 2018-03-23 16:19 GMT-03:00 Luan Carvalho Martins <
> luancarvalhomart...@gmail.com>:
>
>> When you used AllChem.UFFOptimizeMolecule(newMol3D,confId=cid) the
>> minimization proceeded without constraints, therefore, the core embedding
>> was lost. Read the source of ConstrainedEmbed [
>> http://www.rdkit.org/Python_Docs/rdkit.Chem.AllChem-pysrc.h
>> tml#ConstrainedEmbed]. This function does a restricted minimization
>> using AddDistanceConstraint.
>>
>> Sincerely,
>> Luan Carvalho.
>>
>> Atenciosamente, Luan Carvalho Martins
>> luancarvalhomart...@gmail.com
>>
>> On Fri, Mar 23, 2018 at 3:56 PM, Felipe Trajtenberg > > wrote:
>>
>>> Hi Paolo
>>>
>>> great! it was a very simple thing. Now the sdf file with the conformers
>>> is generated but the conformers were not constraints at the core of the
>>> ligand...as I was trying? can you tell me why?
>>>
>>> thanks a lot!
>>>
>>> felipet
>>>
>>> 2018-03-23 15:37 GMT-03:00 Paolo Tosco :
>>>
 Dear Felipe,

 cids is a list of conformer ids, i.e. integer numbers. Therefore

 prbMol = cids[prbNum]

 sets prbMol to the integer value of the prbNum element of the cids list.

 The reason of the error message you are getting:
 Boost.Python.ArgumentError: Python argument types in
 SDWriter.write(SDWriter, int)
 did not match C++ signature:
 write(RDKit::SDWriter {lvalue} self, RDKit::ROMol {lvalue} mol, int
 confId=-1)

 is that you are passing only an int to SDWriter.write(), rather than a
 mol and an int as the function expects.
 What you need is:

 nMol = len(cids)
 w = Chem.SDWriter('conf_output.sdf')

 for prbNum in range(0, nMol):
 prbMol = cids[prbNum]
 w.write(newMol3D, prbMol)
 w.close()

 or, more simply:

 w = Chem.SDWriter('conf_output.sdf')

 for cid in cids:
 w.write(newMol3D, cid)
 w.close()

 Cheers,
 p.

 On 03/23/18 18:04, Felipe Trajtenberg wrote:

 Dear all,

 sorry but I am really new at using RDkit. By looking at the scripts and
 tutorial available I wrote the following script. The idea is to generate a
 number of conformers for a big and flexible ligand, but with constraints.
 This script generate a set of conformers but I can't write a SDF file with
 all of them. The error I get is:

 Traceback (most recent call last):
   File "", line 3, in 
 Boost.Python.ArgumentError: Python argument types in
 SDWriter.write(SDWriter, int)
 did not match C++ signature:
 write(RDKit::SDWriter {lvalue} self, RDKit::ROMol {lvalue} mol, int
 confId=-1)

 Thanks in advance for any help

 felipet

 The script is:

 from rdkit import Chem
 from rdkit.Chem import AllChem
 import os

 mols = [x for x in Chem.SDMolSupplier('out.CRC.sdf',removeHs=False) if
 x is not None]
 core = Chem.MolFromSmarts('CCC(O)=O')


 em = Chem.EditableMol(mols[0])
 match = mols[0].GetSubstructMatch(core)
 for idx in range(mols[0].GetNumAtoms()-1,-1,-1):
 if idx not in match:
 em.RemoveAtom(idx)

 coreMol = em.GetMol()
 Chem.SanitizeMol(coreMol)

 newMol = Chem.MolFromSmiles('CC(O)=O')

 newMol=Chem.AddHs(newMol)
 newMol3D=AllChem.ConstrainedEmbed(newMol,coreMol)

 cids=AllChem.EmbedMultipleConfs(newMol3D,numConfs=100,pruneR
 msThresh=1.0,enforceChirality=True)
 for cid in cids: AllChem.UFFOptimizeMolecule(newMol3D,confId=cid)

 nMol = len(cids)
 w = Chem.SDWriter('conf_output.sdf')

 for prbNum in range(0, nMol):
 prbMol = cids[prbNum]
 w.write(prbMol)
 w.c

Re: [Rdkit-discuss] MolFromMol2Block changes carboxylic group representation

2018-03-28 Thread Chris Earnshaw
Hi Maria

I would say that the behaviour of RDKit with your MOL2 file is right.
SMILES notation doesn't have a way to represent the delocalised form of a
carboxylate anion, so the O=C[O-] form is the correct SMILES for this
structure. RDKit  does a good job in recognising that it's the anion based
on the fractional charges in the MOL2 file (openBabel would give the
neutral carboxylic acid as output).

Using aromatic atoms/bonds to represent delocalisation generally is not
possible in SMILES. It's also not strictly correct to use aromatic bonds
(type 4) in an SD or MOL file either unless the structure is a query - the
representation of a 'real' molecule in this file format should also be in
the charge-separated rather than delocalised form.

Sorry that this isn't particularly helpful for your case, but it is correct
behaviour.

Best regards,
Chris

On 28 March 2018 at 10:53, Maria Matveyeva  wrote:

> Sorry, i forgot to ask a question:  how can i retain the initial form?
> Maria
>
> 2018-03-28 11:46 GMT+02:00 Maria Matveyeva :
>
>> Hello all,
>> When i read mol2 format from either mol2 file or block, it changes
>> represenation of carboxylic group from tripos  aromatic representaiton with
>> -0.5 charges on oxygens to representaion with one single and one double
>> bond (when same representation read from sdf/mol  it retains "aromatic"
>> form):
>>
>>
>>
>> from rdkit import Chem
>>
>> mol = Chem.MolFromMol2Block('@MOLECULE\n\n7 6
>> 1\nSMALL\nUSER_CHARGES\n@ATOM\n1\tC183.123548.4843
>> -1.9335\tC.3\t\t1\tnoname\t0.\n2\tC284.405549.1718
>> -2.1563\tC.2\t\t1\tnoname\t0.\n3\tO185.578348.5739
>> -1.8454\tO.co2\t1\tnoname -0.5000\n4\tO284.432750.3304
>> -2.6301\tO.co2\t1\tnoname -0.5000\n5\tH182.711747.9177
>>  0.7151\tH\t\t1\tnoname\t0.\n6\tH282.615348.8969
>> -1.0496\tH\t\t1\tnoname\t0.\n7\tH382.472748.5910
>> -2.8139\tH\t\t1\tnoname\t0.\n@BOND\n1\t1\t2\t1\
>> n2\t2\t3\tar\n3\t2\t4\tar\n4\t1\t6\t1\n5\t1\t7\t1\n6\t5\t1\t1\n@
>> SUBSTRUCTURE\n1\tnoname\t1\n')
>>
>> Chem.MolToSmiles(mol)
>>
>> 'CC(=O)[O-]'
>>
>> Thanks in advance,
>> Maria
>>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] elimination of small fragments

2018-06-29 Thread Chris Earnshaw
I'd say that using RDkit to calculate the numbers of heavy atoms is
significantly more robust than a purely lexical approach - and it's easy to
implement.

It's also dangerous to just discard the smallest fragment. Years ago I
worked on a project where the active molecule had only 11 heavy atoms and
the counterion (dicyclohexylamine) had 13 - so relying on atom counts is a
way to sometimes throw the baby out with the bath water. It's much safer
(but also a lot more work) to build a desalter/desolvater that explicitly
removes just the fragments you really want to remove.

Best regards,
Chris

On 29 June 2018 at 09:56, Ed Griffen  wrote:

> Using the string length to find the number of atoms in a molecule is OK -
> but you need to take account of the additional characters in SMILES that
> are not just atoms, for example:
>
> two letter elements - like silicon, chlorine etc
> brackets , ring closures, charges, explicit hydrogens
>
> It’s simple to do:
>
> Here’s a worked example:
>
> >>> SMILES = 'C[S@@+]([O-])c1ccc(cc1)[Si](C)(C)C'
> >>> print(len(SMILES))
> 34
> >>> heavies = [char for char in SMILES if char not in
> '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''']
> >>> print(len(heavies))
> 13
>
> obviously you do this after splitting on the .
>
> Best regards,
>
> Ed
>
> Dr Ed Griffen,
> Technical Director,
> mobile +44 7762 121593
> office +44 1625 238843
> ed.grif...@medchemica.com
> www.medchemica.com
> skype: ed.griffen
> Twitter: @MedChemica
> Medchemica Ltd is a company registered in England and Wales with company
> number 8162245.
>
> Confidentiality Notice: This message is private and may contain
> confidential, proprietary and legally privileged information. If you have
> received this message in error, please notify us and remove it from your
> system and note that you must not copy, distribute or take any action in
> reliance on it. Any unauthorised use or disclosure of the contents of
> this message is not permitted and may be unlawful.
> Disclaimer: Email messages may be subject to delays, interception,
> non-delivery and unauthorised alterations. Therefore, information expressed
> in this message is not given or endorsed by MedChemica Limited unless
> otherwise notified by an authorised representative independent of this
> message. No contractual relationship is created by this message by any
> person unless specifically indicated by agreement in writing other than
> email.
> Monitoring: MedChemica Limited retains and monitors all email traffic data
> and content for the purposes of the prevention and detection of
> crime, ensuring the security of our computer systems and checking
> compliance with our policies.
>
> On 29 Jun 2018, at 06:37, Alfredo Quevedo  wrote:
>
> thank you Hideyoshi for your feedback.
> regards
> Alfredo
>
> Enviado desde BlueMail 
> En 28 de junio de 2018, en 21:43, "藤秀義" 
> escribió:
>>
>> Dear Alfredo,
>>
>> Although not strictly based on the number of atoms, but on the length of
>> SMILES string, the simplest way is using Python built-in functions as
>> follows:
>>
>> smiles = 'CCC.CC'
>> fragment = max(smiles.split('.'), key=len)
>> print (fragment)
>>
>> Best regards,
>>
>> Hideyoshi
>>
>>
>> thank you Paolo for this help, I will study the code and try it,
>>>
>>> best regards
>>>
>>> Alfredo
>>>
>>> Enviado desde BlueMail 
>>>
>>> En 28 de junio de 2018, en 17:08, Paolo Tosco <
>>> paolo.tosco.m...@gmail.com> escribió:
>>>
>>> Dear Alfredo,
>>>
>>> if you wish to keep only the largest disconnected fragment you may try
>>> the following:
>>>
>>> mols = list(rdmolops.GetMolFrags(mol, asMols = True))
>>> if (mols):
>>>  mols.sort(reverse = True, key = lambda m: m.GetNumAtoms())
>>>  mol = mols[0]
>>>
>>> Hope that helps, cheers
>>> p.
>>>
>>> On 06/28/18 19:38, Alfredo Quevedo wrote:
>>>
>>>  Good afternoon,
>>>
>>>  I would like to filter out small fragments from a list of molecules
>>>  using the below strategy:
>>>
>>>  from rdkit import Chem
>>>  from rdkit.Chem import AllChem
>>>  from rdkit.Chem import SaltRemover fragment
>>>
>>>  remover=SaltRemover.SaltRemover()
>>>  mol=Chem.MolFromSmiles('CCC.CC')
>>>  res=remover.StripMol(mol)
>>>  print(res.GetNumAtoms())
>>>
>>>
>>>  I am getting 5 atoms as output, so the ´CC´ is not being stripped (the
>>>  script workd ok for salts). Is there any way of filtering non salts
>>>  small fragments?
>>>
>>>  thank you very much in advance,
>>>
>>>  regards,
>>>
>>>  Alfredo
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>>  Check out the vibrant tech community on one of the world's most
>>>  engaging tech sites, Slashdot.org ! 
>>> http://sdm.link/slashdot
>>>
>>> --
>>>
>>>
>>>  Rdkit-discuss mailing list
>>>  Rdkit-discuss@lists.sourceforge.net
>>>  https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>>
>>> ---

Re: [Rdkit-discuss] elimination of small fragments

2018-06-29 Thread Chris Earnshaw
Ed,  As always, there are no 'one size fits all' solutions, it all depends
on what you need to do. I was processing tens of millions of screening
compounds into a database and used a desalter/desolvator written using the
RDkit C++ API. That was quite quick enough for my needs - I never tried it
with Python so I don't know what sort of performance difference there would
be.

Your pyrrolidinium tosylate example is a good one and again 'it all
depends'. In my case it would be removed completely and I'd be happy with
that. For other purposes it may not be so obvious - maybe you'd want to
keep both fragments in some cases...

Chris

On 29 June 2018 at 12:09, Ed Griffen  wrote:

> Chris, Absolutely agree with your points - processing the molecules into
> RDkit is much more robust, but it depends though on how many you’ve got to
> process.  If you’re doing millions to billions, then the overhead can
> become a problem and doing it in two steps (lexical then graph) can be the
> pragmatic solution.
>
> Desalting by removing the smallest fragment performs as expected -
> pyrollidinium tosylate - which part of the salt do you want to discard? If
> you don’t know it’s hard to create a heuristic.
>
> C1CC[NH2+]C1.Cc1ccc(cc1)S([O-])(=O)=O
>
> Ed
>
>
> Dr Ed Griffen,
> Technical Director,
> mobile +44 7762 121593
> office +44 1625 238843
> ed.grif...@medchemica.com
> www.medchemica.com
> skype: ed.griffen
> Twitter: @MedChemica
> Medchemica Ltd is a company registered in England and Wales with company
> number 8162245.
>
> Confidentiality Notice: This message is private and may contain
> confidential, proprietary and legally privileged information. If you have
> received this message in error, please notify us and remove it from your
> system and note that you must not copy, distribute or take any action in
> reliance on it. Any unauthorised use or disclosure of the contents of
> this message is not permitted and may be unlawful.
> Disclaimer: Email messages may be subject to delays, interception,
> non-delivery and unauthorised alterations. Therefore, information expressed
> in this message is not given or endorsed by MedChemica Limited unless
> otherwise notified by an authorised representative independent of this
> message. No contractual relationship is created by this message by any
> person unless specifically indicated by agreement in writing other than
> email.
> Monitoring: MedChemica Limited retains and monitors all email traffic data
> and content for the purposes of the prevention and detection of
> crime, ensuring the security of our computer systems and checking
> compliance with our policies.
>
> On 29 Jun 2018, at 11:59, Chris Earnshaw  wrote:
>
> I'd say that using RDkit to calculate the numbers of heavy atoms is
> significantly more robust than a purely lexical approach - and it's easy to
> implement.
>
> It's also dangerous to just discard the smallest fragment. Years ago I
> worked on a project where the active molecule had only 11 heavy atoms and
> the counterion (dicyclohexylamine) had 13 - so relying on atom counts is a
> way to sometimes throw the baby out with the bath water. It's much safer
> (but also a lot more work) to build a desalter/desolvater that explicitly
> removes just the fragments you really want to remove.
>
> Best regards,
> Chris
>
> On 29 June 2018 at 09:56, Ed Griffen  wrote:
>
>> Using the string length to find the number of atoms in a molecule is OK -
>> but you need to take account of the additional characters in SMILES that
>> are not just atoms, for example:
>>
>> two letter elements - like silicon, chlorine etc
>> brackets , ring closures, charges, explicit hydrogens
>>
>> It’s simple to do:
>>
>> Here’s a worked example:
>>
>> >>> SMILES = 'C[S@@+]([O-])c1ccc(cc1)[Si](C)(C)C'
>> >>> print(len(SMILES))
>> 34
>> >>> heavies = [char for char in SMILES if char not in
>> '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''']
>> >>> print(len(heavies))
>> 13
>>
>> obviously you do this after splitting on the .
>>
>> Best regards,
>>
>> Ed
>>
>> Dr Ed Griffen,
>> Technical Director,
>> mobile +44 7762 121593
>> office +44 1625 238843
>> ed.grif...@medchemica.com
>> www.medchemica.com
>> skype: ed.griffen
>> Twitter: @MedChemica
>> Medchemica Ltd is a company registered in England and Wales with company
>> number 8162245.
>>
>> Confidentiality Notice: This message is private and may contain
>> confidential, proprietary and legally privileged in

Re: [Rdkit-discuss] Error if run Draw in Python

2018-07-11 Thread Chris Earnshaw
Hi

I'm no Python expert, but I think the problem is that Python doesn't (by
default) do filename globbing. As a result it doesn't understand the
significance of the ~ character in your directory path and tries to
interpret it literally. The simple solution is to just give a path that can
be interpreted directly (e.g.
/home/your_username/Desktop/rest_of_the_path), alternatively install the
Python glob module to parse your current path. The former seems to be a
much simpler solution!

Best regards,
Chris

On 11 July 2018 at 00:28, Phuong Chau  wrote:

> Thank you so much!
>
> On Tue, Jul 10, 2018 at 4:33 PM, Malitha Kabir 
> wrote:
>
>> Hi,
>>
>> Probably the designated folder is missing. So, rdkit cannot create the
>> file in specified path. Thanks.
>>
>> - malitha
>>
>> On Wed, Jul 11, 2018, 5:46 AM Phuong Chau  wrote:
>>
>>> Hello,
>>>
>>> I was trying to draw the 2D structure of a molecule inside a python
>>> script (.py). It works with other functions such as MolToSmiles,
>>> FingerPrintMol(),... but somehow the Draw.MolToSmile() function does not
>>> work. It kept throw errors of:
>>>  File "PairsFinder.py", line 53, in 
>>> Draw.MolToFile(chem,'~/Desktop/work/2018/July/10/500_v2/
>>> Chem1_avg7.png')
>>>   File "/usr/lib/python2.7/dist-packages/rdkit/Chem/Draw/__init__.py",
>>> line 182, in MolToFile
>>> canvas.save()
>>>   File "/usr/lib/python2.7/dist-packages/rdkit/Chem/Draw/spingCanvas.py",
>>> line 111, in save
>>> self.canvas.save()
>>>   File "/usr/lib/python2.7/dist-packages/rdkit/sping/PIL/pidPIL.py",
>>> line 166, in save
>>> self._image.save(filename, format=format)
>>>   File "/usr/lib/python2.7/dist-packages/PIL/Image.py", line 1672, in
>>> save
>>> fp = builtins.open(filename, "wb")
>>> IOError: [Errno 2] No such file or directory:
>>> '~/Desktop/work/2018/July/10/500_v2/Chem1_avg7.png'
>>> My script is :
>>> chem = Chem.MolFromSmiles('Nc1c1F')
>>> Draw.MolToFile(chem,'~/Desktop/work/2018/July/10/500_v2/Chem1_avg7.png')
>>>
>>> Would anyone help me please? How do I run Draw in python script?
>>>
>>> Thank you so much for your help
>>> --
>>> Phuong Chau
>>> Smith College '20
>>> Engineering Major
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot__
>>> _
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>
>
> --
> Phuong Chau
> Smith College '20
> Engineering Major
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Problem getting valence

2018-07-26 Thread Chris Earnshaw
Hi

It looks to me like N5 [nH:5] also has a problem. This has 3 connections to
heavy atoms, is specified to have a hydrogen attached, but has no charge.
This may not have triggered an error but it looks wrong, especially in this
structure. Surely this atom should just be [n:5] ?

Best regards,
Chris

On 26 July 2018 at 10:42, Paolo Tosco  wrote:

> Dear Lewis,
>
> C #7 indeed has 5 valences:
>
>  *
>  |
> -C=O
>  |
>  H
>
>
> If you change [CH:14] into [C:14] sanitization will succeed.
>
> Cheers,
> p.
>
> On 07/26/18 09:42, Lewis J Martin wrote:
>
> Hi all,
>
> I have generated a molecule that I can both visualize and export as
> SMILES, but I can't Sanitize it. Creating a new molecule from that SMILES
> fails due to something to do with the valence. It seems to me that it has
> the correct amount of bonds.
>
> Code to reproduce and 'debugging' that Ive tried:
> --
> #sanitize=False or it won't load this SMILES
> mol = Chem.MolFromSmiles('[CH3:0][CH2:1][CH2:2][CH2:3][CH2:4][
> nH:5]1[c:6]([CH:14](=[O:15])[*:16])[n:7][c:8]2[cH:9][cH:10][
> cH:11][cH:12][c:13]12',sanitize=False)
>
> #print atom indices and atomic number:
> for i in mol.GetAtoms():
> print(i, i.GetIdx(), i.GetAtomicNum())
>
> #display bonds relating to index 7. Seems correct for a carbon.
> for i in mol.GetBonds():
> if i.GetBeginAtomIdx()==7 or i.GetEndAtomIdx()==7:
> print(i.GetBeginAtomIdx(), i.GetEndAtomIdx(), i.GetBondType())
>
> #print valences. Fails on index 7.
> for i in mol.GetAtoms():
> print(i.GetIdx(), i.GetExplicitValence())
> --
>
>
> Can anyone please offer some advice as to what the problem is?
> Much appreciated!
>
> Lewis
>
>
> PS. here is the output I get:
>
>  0 6
>  1 6
>  2 6
>  3 6
>  4 6
>  5 7
>  6 6
>  7 6
>  8 8
>  9 0
>  10 7
>  11 6
>  12 6
>  13 6
>  14 6
>  15 6
>  16 6
> 6 7 SINGLE
> 7 8 DOUBLE
> 7 9 SINGLE
> 0 4
> 1 4
> 2 4
> 3 4
> 4 4
> 5 3
> 6 4
>
> ---RuntimeError
>   Traceback (most recent call 
> last) in ()  7   8 for i in 
> mol.GetAtoms():> 9 print(i.GetIdx(), i.GetExplicitValence()) 10   
>   #valence = oneMol.GetAtomWithIdx(7).GetExplicitValence() 11 
> #print(valence)
> RuntimeError: Pre-condition Violation
>   getExplicitValence() called without call to calcExplicitValence()
>   Violation occurred on line 162 in file Code/GraphMol/Atom.cpp
>   Failed Expression: d_explicitValence > -1
>   RDKIT: 2018.03.3
>   BOOST: 1_65_1
>
>
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
>
>
> ___
> Rdkit-discuss mailing 
> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Chris Earnshaw
Hi

The question 'what do you mean by ALL?' springs to mind. None of the
discussion includes dot-disconnected SMILES, which are also perfectly valid
representations. For example C(C1C2)C.C12 is yet another SMILES (of many
possible) for the example structure.

I've no idea whether this is of any relevance to you, but you should
probably consider these representations and decide whether they are
important or not.

Best regards,
Chris

On 6 August 2018 at 11:27, Jan Halborg Jensen  wrote:

> This blogpost links to two other ones that may have done that (haven’t
> read them carefully): https://baoilleach.blogspot.com/2018/06/
> cheminformatics-for-deep-learners.html
>
> Best regards, Jan
>
> On 06 Aug 2018, at 11:57, Guillaume GODIN 
> wrote:
>
> Dear Greg,
>
> Fantastic, thank you to give both explanation and solution to this “simple
> question”, I know this is not so simple & it’s fundamental for data
> augmentation in deep learning.
>
> If I may, I have another question related, do you know if someone has
> worked on a generator of all unique smiles independently of RDKit ?
>
> Thanks again,
>
> Guillaume
>
> *De : *Greg Landrum 
> *Date : *lundi, 6 août 2018 à 11:40
> *À : *Guillaume GODIN 
> *Cc : *RDKit Discuss 
> *Objet : *Re: [Rdkit-discuss] enumeration of smiles question
>
>
> On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN <
> guillaume.go...@firmenich.com> wrote:
>
>
> I have a simple question about generating all possible smiles of a given
> molecule:
>
>
> It's a simple question, but the answer is somewhat complicated. :-)
>
>
>
> RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
> C1C(CC)C1
> CCC1CC1
> C1(CC)CC1
> C(C)C1CC1
>
> While by hand we can write those 7 smiles:
> CCC1CC1
> C(C)C1CC1
> C(C1CC1)C
> C1CC(CC)1
> C1C(CC)C1
> C1CC1CC
> C(CC)1CC1
>
> I use this function for the enumeration:
>
> def allsmiles(smil):
> m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES
> string.
> if m is None:
> return smil
> N = m.GetNumAtoms()
> if N==0:
> return smil
> try:
> n= np.random.randint(0,high=N)
> t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
> except :
> return smil
> return t
>
> n= 50
> SMILES = [“CCC1CC1”]
> SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]
>
> Why we cannot generate all the 7 smiles ?
>
>
> The RDKit has rules that it uses to decide which atom to branch to when
> generating a SMILES. These are used regardless of whether you are
> generating canonical SMILES or not.
> The upshot of this is that it will never generate a SMILES where there's a
> branch before a ring closure.
> The other important factor here is that atom rank is determined by the
> index of the atom in the molecule when you aren't using canonicalization.
> So changing the atom order on input can help:
>
> In [12]: set(allsmiles('CCC1CC1') for i in range(50))
> Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}
>
> In [13]: set(allsmiles('C1CC1CC') for i in range(50))
> Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
>
> You can do this all at once as follows:
>
> ```
> In [20]: def allsmiles(smil):
> ...: m = Chem.MolFromSmiles(smil) # Construct a molecule from a
> SMILES string.
> ...: if m is None:
> ...: return smil
> ...: N = m.GetNumAtoms()
> ...: if N==0:
> ...: return smil
> ...: aids = list(range(N))
> ...: random.shuffle(aids)
> ...: m = Chem.RenumberAtoms(m,aids)
> ...: try:
> ...: n= random.randint(0,N-1)
> ...: t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
> ...: except :
> ...: return smil
> ...: return t
> ...:
> ...:
> ...:
>
> In [21]:
>
> In [21]: set(allsmiles('C1CC1CC') for i in range(50))
> Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC',
> 'CCC1CC1'}
> ```
> Note that I switched to using python's built in random module instead of
> using the one in numpy.
>
> -greg
>
>
>
>
>
> Thanks guys,
>
> Best regards,
>
> Guillaume
> 
> ***
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
> 
> ***
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org ! http://sd
> m.link/slashdot___
> Rdkit-discuss mailing list
> Rdkit

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-21 Thread Chris Earnshaw
Hi

I'm afraid I can't help with an RDkit solution to your question, but there
are a couple of issues which should be born in mind:
1) The centroid of a cluster is a vector mean of the fingerprints of all
the members of the cluster and probably will not be represented *exactly*
by any member of the cluster; in this case no structures will have a
distance of 0.0 from the centroid. Do you want to calculate the distances
from the true centroid or from the structure(s) closest to the centroid?
2) The Tanimoto metric doesn't obey the triangle inequality and is
therefore sub-optimal for this kind of analysis. It's better to use an
alternative which does obey the triangle inequality - e.g. the Cosine
metric.

Regards,
Chris Earnshaw


On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> RDkit Discussion Group,
>
> I note that RDkit can perform Butina clustering.  Given an SDF of
> small molecules I would like to cluster the ligands, but obtain additional
> information from the clustering algorithm.  In particular, I would like to
> obtain
> the cluster number and Tanimoto distance from the centroid for every ligand
> in the SDF.  The centroid would obviously have a distance of 0.00.
>
> Has anyone written additional RDkit code to extract this additional
> information?
> Thank you.
>
> Regards,
> Jim Metz
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-27 Thread Chris Earnshaw
Hi

There's a degree of confusion here depending on whether people are
considering *similarity* (tanimoto, cosine, whatever) or *distance*
(however that's defined). The two are very different from each other.
Cosine *similarity* obeys the triangle inequality; cosine *distance*
doesn't - pretty much by definition. Tanimoto *similarity* does not obey
the inequality, but Tanimoto *distance* may - depending on how you define
it. It's very important not to get confused about whether it's similarity
or distance that you're considering.

Years ago I worked on some software for fingerprint-based similarity
searching using a variety of metrics, and also for clustering. This was
based entirely on *similarity* calculations, and clustering used the cosine
metric for this reason. BTW, based on this work I recommend k-means
relocation clustering, but it needs carefully optimised code to make it run
fast enough to be useful.

Regards,
Chris Earnshaw

On Thu, 27 Sep 2018 at 02:36, Francois Berenger  wrote:

> On 21/09/2018 16:53, Chris Earnshaw wrote:
> > Hi
> >
> > I'm afraid I can't help with an RDkit solution to your question, but
> > there are a couple of issues which should be born in mind:
> > 1) The centroid of a cluster is a vector mean of the fingerprints of
> > all the members of the cluster and probably will not be represented
> > _exactly_ by any member of the cluster; in this case no structures
> > will have a distance of 0.0 from the centroid. Do you want to
> > calculate the distances from the true centroid or from the
> > structure(s) closest to the centroid?
>
> I have seen 'clustroid' in the literature to mean
> cluster member nearest to the centroid of that cluster.
>
> > 2) The Tanimoto metric doesn't obey the triangle inequality and is
> > therefore sub-optimal for this kind of analysis. It's better to use an
> > alternative which does obey the triangle inequality - e.g. the Cosine
> > metric.
>
> The opposite is true.
>
> Sven Kosub. A note on the triangle inequality for the jaccard distance.
> CoRR, abs/1612.02696, 2016.
>
> Alan H. Lipkus. A proof of the triangle inequality for the tanimoto dis-
> tance. Journal of Mathematical Chemistry, 26(1):263–265, Oct 1999.
>
> While cosine similarity is not a metric, according to wikipedia.
>
> I'm not a mathematician, but I think (1 - Tanimoto) is a proper distance
> as long as the molecules are encoded with only positive values.
> So, Boolean fingerprints are OK, and counted unfolded fingerprints
> as well.
>
> Regard,
> Francois.
>
> > Regards,
> > Chris Earnshaw
> >
> > On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss
> >  wrote:
> >
> >> RDkit Discussion Group,
> >>
> >> I note that RDkit can perform Butina clustering. Given an SDF
> >> of
> >> small molecules I would like to cluster the ligands, but obtain
> >> additional
> >> information from the clustering algorithm. In particular, I would
> >> like to obtain
> >> the cluster number and Tanimoto distance from the centroid for every
> >> ligand
> >> in the SDF. The centroid would obviously have a distance of 0.00.
> >>
> >> Has anyone written additional RDkit code to extract this
> >> additional information?
> >>
> >> Thank you.
> >>
> >> Regards,
> >>
> >> Jim Metz
> >>
> >> ___
> >> Rdkit-discuss mailing list
> >> Rdkit-discuss@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss [1]
> >
> >
> > Links:
> > --
> > [1] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Chris Earnshaw
Hi

It sounds to me like you're already getting better results than you could
reasonably expect.

Prediction of melting point is a phenomenally difficult thing to do; you're
trying to find the temperature at which a (generally undefined) solid
crystalline phase is in equilibrium with a (probably even less defined)
liquid phase. You also need to consider that the crystalline form of your
solid phase is not necessarily truly constant - what polymorph is involved?
Melting points of alternative polymorphs can be radically different and
this is one of the real bugbears of pharmaceutical and agrochemical
development. If you haven't found the most stable form early in the
development process there can be very nasty surprises downstream.

Expecting to handle all these challenges with a descriptor as simple as a
molecular fingerprint - regardless of bit-length, collisions etc. is
probably over optimistic...

Regards,
Chris Earnshaw

On Wed, 10 Oct 2018 at 13:16, Michal Krompiec 
wrote:

> Hi Thomas,
> Radius 2, 2048 bits, 5200 data points.
>
> On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis 
> wrote:
>
>> What's your bitvector length and radius? How many training samples do you
>> have?
>>
>> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec 
>> wrote:
>>
>>> Hi all,
>>> I have a slightly off-topic question. I'm trying to train a neural
>>> network on a dataset of small molecules and their melting points. I did get
>>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>>> regardless of FP radius and bitvector length, several dozen molecules have
>>> the same fingerprints but wildly different melting points. I am pretty sure
>>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>>> the recommended/usual way of dealing with this?
>>> Thanks,
>>> Michal
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>>
>> ==
>>
>> Dr Thomas Evangelidis
>>
>> Research Scientist
>>
>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>> Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>
>> Prague, Czech Republic
>>   &
>> CEITEC - Central European Institute of Technology
>> <https://www.ceitec.eu/>
>> Brno, Czech Republic
>>
>> email: teva...@gmail.com
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Aromaticity question

2018-10-23 Thread Chris Earnshaw
Sorry about this, but I think that 'perhaps sub-optimal' should be replaced
by 'definitely wrong'. The 'quasi-aromatic' system in these two structures
is identical and should behave as such, but in practice one of them matches
a pyridine SMARTS pattern and the other doesn't. That shouldn't be affected
by whether the saturated substituents form a ring or not. I do appreciate
that it gets messy to deal with as fused rings may be fully conjugated, but
the current behaviour seems to be disturbingly inconsistent. It would be
suboptimal to say that no exocyclic bond is allowed to steal electrons, but
that may be better than what's happening here.

Apologies for the dissent!

Chris Earnshaw



On Tue, 23 Oct 2018 at 11:57, Greg Landrum  wrote:

> The current implementation requires "exocyclic" bonds to actually be
> *non-ring* bonds in order to be recognized as such.
> This is perhaps sub-optimal, but it's clearly defined and avoids arguments
> about when exactly an "exocyclic" bond starts stealing electrons.
>
> -greg
>
> On Tue, Oct 23, 2018 at 12:46 PM Francis Atkinson 
> wrote:
>
>> Ian,
>>
>> I make it 6 electrons: two from the N, none from the C double bonded
>> to the exocyclic N, and one each from four other carbons in the ring. It's
>> isoelectronic with *e.g.* pyridone, which is aromatic in RDKit...
>>
>> In [1]: from rdkit import Chem
>>
>> In [2]: Chem.MolToSmiles(Chem.MolFromSmiles('O=c1[nH]1'))
>> Out[2]: 'O=c1[nH]1'
>>
>> The protonated/tautomerised version are indeed aromatic
>> (interconverting bewteen these species was actually how I came across this
>> issue), but I still reckon the unprotonated bicyclic should be aromatic
>> too...
>>
>> Francis
>> On 23/10/2018 11:18, Ian Tickle wrote:
>>
>>
>> Hi, it seems to me that neither is aromatic since the N-substituted
>> hetero ring breaks the Huckel rule by having 7 e- (2 from the N and 1 each
>> from the 5 Cs).  If you remove 1 e- from the N (so it's [n+]) and also make
>> the external double bond into a single (picking up a proton on the other N)
>> it becomes pyridinium which is certainly aromatic.
>>
>> [n+]12c1NCCC2
>>
>> [n+]12c1NC.CC2
>>
>> What does it make of those?
>>
>> Cheers
>>
>> -- Ian
>>
>>
>> On Tue, 23 Oct 2018 at 10:37, Francis Atkinson  wrote:
>>
>>> Hello,
>>>
>>>  In the following pair of molecules, the bicyclic is non-aromatic,
>>> whereas the 'ring-opened' analogue is aromatic...
>>>
>>> In [1]: from rdkit import Chem
>>>
>>> In [2]: Chem.MolToSmiles(Chem.MolFromSmiles('n12c1=NCCC2'))
>>> Out[2]: 'C1=CC2=NCCCN2C=C1'
>>>
>>> In [3]: Chem.MolToSmiles(Chem.MolFromSmiles('n12c1=NC.CC2'))
>>> Out[3]: 'CCn1c1=NC'
>>>
>>> Notebook version:
>>>
>>> https://nbviewer.jupyter.org/gist/flatkinson/b88eb42510a79594a9e37042eeb7e224
>>>
>>> This seems counter-intuitive to me: I don't see why the pyridine in the
>>> first molecule shouldn't be aromatic, just as it is in the second.
>>>
>>> Am I missing something here?
>>>
>>>  Thanks,
>>>
>>>  Francis
>>>
>>> --
>>> Dr Francis L Atkinson
>>>
>>> Chemogenomics Group
>>> European Bioinformatics Institute (EMBL-EBI)
>>> European Molecular Biology Laboratory
>>> Wellcome Genome Campus
>>> Hinxton
>>> Cambridge CB10 1SD
>>> United Kingdom
>>>
>>> (01223) 494473
>>>
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> --
>> Dr Francis L Atkinson
>>
>> Chemogenomics Group
>> European Bioinformatics Institute (EMBL-EBI)
>> European Molecular Biology Laboratory
>> Wellcome Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>> United Kingdom
>>
>> (01223) 494473
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Aromaticity question

2018-10-23 Thread Chris Earnshaw
Mea culpa - I hit Reply rather than Reply All and so only sent this to
Greg...

On Tue, 23 Oct 2018 at 13:53, Chris Earnshaw  wrote:

> Hi Greg
>
> Apologies again, I'm not trying to stir things up here. As we can see from
> some of the the other discussion there's no clear view of what constitutes
> aromaticity in these cases. I'm of the school which says that pyridone is
> at least somewhat aromatic because, in crude terms, the electronegative
> carbonyl oxygen 'steals' the electron from the carbon, the carbon provides
> an empty p-orbital to the conjugated ring, and the ring nitrogen provides a
> pair of electrons - hence 4n + 2 and aromatic.
>
> However, the thing that really worries me is that the 'iminopyridine' ring
> in n12c1=NCCC2 *should* be perceived in the same way as in
> n12c1=NC.CC2 but in practice that doesn't happen - one matches the
> pyridine SMARTS c1n1 and the other doesn't. This seems to be
> potentially dangerous. The question of 'aromatic or not' is interesting,
> but I'm actually more concerned with the consequences for compound
> searching and filtering.
>
> As an approach, rather than simply checking if the exocyclic bond is in
> another ring (of any type), would it be possible to check if that other
> ring is fully conjugated? If it is, then the Huckel rule could/should be
> applied to the fused system to determine aromaticity. If not, then the fact
> the substituents form a ring is irrelevant and the potential aromatic
> should be treated in the same way as the non-fused analogue. This would
> avoid the current inconsistency, but there would no doubt still be some
> challenging edge cases...
>
> Best regards,
> Chris
>
> On Tue, 23 Oct 2018 at 12:43, Greg Landrum  wrote:
>
>> Dissent is fine, but it's important to remember that there are *always*
>> going to be edge cases and that we're not trying to model something
>> physically observable here. The concept of aromaticity is primarily there
>> to make canonicalization easier. Section 3.4.2 here:
>> http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html has more
>> info about this, as does the RDKit documentation:
>> http://rdkit.org/docs/RDKit_Book.html#aromaticity
>>
>> I'm willing to change the current behavior, but someone would need to
>> explain what it should be changed to in a way that is clear, unambiguous,
>> and that would allow a human being looking at the structure to relatively
>> easily figure out whether or not a given ring is aromatic.
>>
>>
>> On Tue, Oct 23, 2018 at 1:17 PM Chris Earnshaw 
>> wrote:
>>
>>> Sorry about this, but I think that 'perhaps sub-optimal' should be
>>> replaced by 'definitely wrong'. The 'quasi-aromatic' system in these two
>>> structures is identical and should behave as such, but in practice one of
>>> them matches a pyridine SMARTS pattern and the other doesn't. That
>>> shouldn't be affected by whether the saturated substituents form a ring or
>>> not. I do appreciate that it gets messy to deal with as fused rings may be
>>> fully conjugated, but the current behaviour seems to be disturbingly
>>> inconsistent. It would be suboptimal to say that no exocyclic bond is
>>> allowed to steal electrons, but that may be better than what's happening
>>> here.
>>>
>>> Apologies for the dissent!
>>>
>>> Chris Earnshaw
>>>
>>>
>>>
>>> On Tue, 23 Oct 2018 at 11:57, Greg Landrum 
>>> wrote:
>>>
>>>> The current implementation requires "exocyclic" bonds to actually be
>>>> *non-ring* bonds in order to be recognized as such.
>>>> This is perhaps sub-optimal, but it's clearly defined and avoids
>>>> arguments about when exactly an "exocyclic" bond starts stealing electrons.
>>>>
>>>> -greg
>>>>
>>>> On Tue, Oct 23, 2018 at 12:46 PM Francis Atkinson 
>>>> wrote:
>>>>
>>>>> Ian,
>>>>>
>>>>> I make it 6 electrons: two from the N, none from the C double
>>>>> bonded to the exocyclic N, and one each from four other carbons in the
>>>>> ring. It's isoelectronic with *e.g.* pyridone, which is aromatic in
>>>>> RDKit...
>>>>>
>>>>> In [1]: from rdkit import Chem
>>>>>
>>>>> In [2]: Chem.MolToSmiles(Chem.MolFromSmiles('O=c1[nH]1'))
>>>>> Out[2]: 'O=c1c

Re: [Rdkit-discuss] Aromaticity question

2018-10-23 Thread Chris Earnshaw
Hi
I think my approach to this is - Is there a resonance form in which the
ring in question in unequivocally aromatic and the separated charge ends up
somewhere sensible? The 'electron stealing' concept  is a sort of handy
shortcut for this.

For Greg's examples, I'd say:
[image: image.png]
I'm not sure I understand the
'If a "more electronegative atom" is valence saturated and has a double
bond to a C, then it contributes two electrons to whatever ring system it's
in'
statement. The key thing is probably that the 'more electronegative atom'
has to be outside the conjugated, potentially aromatic ring-system.
Following this analysis means you don't need to consider the resonance
form:
A carbonyl or imine (open chain or in a partially saturated ring) group's
carbon atom provides (effectively) 0 pi electrons
'Ordinary' aromatic carbons provide 1 pi-electron
Pyridine-type nitrogens (2-connections) or pyridinium (3-connections)
provide 1 pi-electron
Pyrrole-type nitrogens (3-connections) provide 2 pi-electrons.

So for example 3, that's
[image: image.png]
10 pi-electrons in total, so the *system* is aromatic even though the
electron counts look odd on a per-ring basis (not unlike azulene).

I hope that makes sense!

Chris

On Tue, 23 Oct 2018 at 16:17, Greg Landrum  wrote:

>
> On Tue, Oct 23, 2018 at 5:14 PM Greg Landrum 
> wrote:
>
>>
>> That certainly handles the things we've discussed so far, as well as easy
>> cases like pyridine and quinone. Now I need to try and find some stuff that
>> breaks it.
>>
>>
> What I should have added to this: I don't think it can possibly be this
> simple, so I'm guessing those examples are out there.
>
> -greg
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Aromaticity question

2018-10-23 Thread Chris Earnshaw
Interesting - I do hope your idea works out!

This prompted me to see what happens with azulene, which is another case
where the envelope is aromatic but neither of the individual rings are
based on a simple neutral representation. This ends up being related to
Peter's example; the input SMILES c1c2c1ccc2 gets converted to
c1cc2cc-2c1, where the fusion bond is perceived as 'pure single' rather
than aromatic as it should be, so we do indeed end up with two non-aromatic
rings embedded in an aromatic envelope.

Is there any way to define the aromaticity of an individual ring based on
its mapping to an aromatic envelope? Are there any cases where that
wouldn't be true?

Chris

On Tue, 23 Oct 2018 at 17:35, Greg Landrum  wrote:

>
> I'll try later (likely tomorrow) to explain what I meant a bit better. Or
> maybe I'll just implement it (since it seems like it could be fairly easy).
>
> On Tue, Oct 23, 2018 at 6:13 PM Chris Earnshaw 
> wrote:
>
>>
>> Following this analysis means you don't need to consider the resonance
>> form:
>> A carbonyl or imine (open chain or in a partially saturated ring) group's
>> carbon atom provides (effectively) 0 pi electrons
>> 'Ordinary' aromatic carbons provide 1 pi-electron
>> Pyridine-type nitrogens (2-connections) or pyridinium (3-connections)
>> provide 1 pi-electron
>> Pyrrole-type nitrogens (3-connections) provide 2 pi-electrons.
>>
>> So for example 3, that's
>> [image: image.png]
>> 10 pi-electrons in total, so the *system* is aromatic even though the
>> electron counts look odd on a per-ring basis (not unlike azulene).
>>
>
> On this specific instance: This is exactly what the RDKit currently does.
> However, in this scheme the left ring, which has 7 pi electrons, is not
> aromatic even though both the right ring and the envelope are.
>
> -greg
>
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Smarts notation for aromatic regioisomers

2020-01-29 Thread Chris Earnshaw
Hi

Dot-disconnected fragments are not going to work for this, as you describe.
You need to use recursive SMARTS (see
https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html section
4.4). Something like:
Clc[$(cBr);$(ccBr);$(cccBr)]
should (I hope!) be a reasonable starting point.

Chris Earnshaw

On Wed, 29 Jan 2020 at 11:12, Alexis Parenty 
wrote:

> Hi everyone,
>
>
>
> Is there a way to get a substructure match of regioisomers using a smarts
> by separating the fragments with “.”:
>
>
>
> The following approach works but is too permissive since it will also
> match structures with a bromide or a chloride linked to a aliphatic
> carbon...
>
>
> [image: image.png]
>
>
>
> mol_smiles_structure = Chem.MolFromSmiles("Clc1cc(Br)ccc1")
>
> mol_smarts_fragment = Chem.MolFromSmarts("c1c1.[Cl].[Br]")
>
> print(mol_smiles_structure.HasSubstructMatch(mol_smarts_fragment))
>
> ð  True
>
>
>
>
>
> mol_smiles_structure = Chem.MolFromSmiles("ClCc1cc(Br)ccc1")
>
> print(mol_smiles_structure.HasSubstructMatch(mol_smarts_fragment))
>
> ð  True
>
>
>
> If I specify that the Br and the Cl need to be attached to an aromatic to
> make it less permissive, it no longer match because I cannot specify that
> the two general aromatics [a] also belong to the benzene ring... (i,e. it
> tries to look for 8 aromatics instead of six...)
>
> mol_smiles_structure = Chem.MolFromSmiles("Clc1cc(Br)ccc1")
>
> mol_smarts_fragment = Chem.MolFromSmarts("c1c1.[a][Cl].[a][Br]")
>
> print(mol_smiles_structure.HasSubstructMatch(mol_smarts_fragment))
>
> ð  False
>
>
>
> Thanks!
>
> Alexis
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMILES/SMARTS codes that match multiple atoms

2020-02-09 Thread Chris Earnshaw
Hi

I've always regarded it as dangerous to rely on the use of explicit
hydrogens in search queries and pattern matches. I think it's generally
safer to use H-count properties in your SMARTS. In your example case this
will require the use of recursive SMARTS to allow matching of the CH3 and
CH2Cn fragments you're interested in. The SMARTS
"[CH](=O)O[$(CH3);$([CH2]C)]" should do what you want. The [CH] forces it
to only match formate esters. The recursive SMARTS [$(CH3);$([CH2]C)] can
be interpreted as 'an atom which is EITHER an aliphatic carbon with 3
hydrogens OR an aliphatic carbon with two hydrogens and an attached
aliphatic carbon'. It's possible to build very powerful queries using this
kind of approach, and it's not necessary to add explicit Hs to make it work.

from rdkit import Chem
mols = [Chem.MolFromSmiles(s) for s in ["C(=O)OC", "C(=O)OCC", "C(=O)OCCC"]]
pat1 = Chem.MolFromSmarts("[CH](=O)O[$(CH3);$([CH2]C)]")
[mol.HasSubstructMatch(mol, pat1) for mol in mols]
[True, True, True]

All the best,
Chris

On Sat, 8 Feb 2020 at 20:29, Andrew Dalke  wrote:

> On Feb 8, 2020, at 17:55, Janusz Petkowski  wrote:
> >
> > If not how can I match cases where in a given position there can be C or
> H with rdkit?
>
> I believe you should use #1 instead of H.
>
>
> >>> from rdkit import Chem
> >>> mols = [Chem.MolFromSmiles(s) for s in ["C(=O)OC", "C(=O)OCC",
> "C(=O)OCCC"]]
> >>> hmols = [Chem.AddHs(mol) for mol in mols]
>
>
>   Your pattern:
>
> >>> pat1 = Chem.MolFromSmarts("[H]C(=O)OC([C,H])([H])[H]")
> >>> [mol.HasSubstructMatch(pat1) for mol in hmols]
> [False, True, True]
>
>   Using #1 instead of H:
>
> >>> pat2 = Chem.MolFromSmarts("[H]C(=O)OC([C,#1])([#1])[#1]")
> >>> [mol.HasSubstructMatch(pat2) for mol in hmols]
> [True, True, True]
>
>
> "H" has an odd interpretation.
> https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html says:
>
> Note that atomic primitive H can have two meanings,
> implying a property or the element itself. [H] means
> hydrogen atom. [*H2] means any atom with exactly
> two hydrogens attached
>
> I believe the goal of having [H] match a hydrogen atom is to allow a
> SMILES, when interpreted as a SMARTS, to be able to match the SMILES when
> interpreted as a molecule. I'm not sure about that though.
>
> Cheers,
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMILES/SMARTS codes that match multiple atoms

2020-02-09 Thread Chris Earnshaw
Sorry - tried to type this too early in the morning and introduced some
errors transcribing the SMARTS pattern!

It should have been "[CH](=O)O[$([CH3]),$([CH2]C)]") as in
pat1 = Chem.MolFromSmarts("[CH](=O)O[$([CH3]),$([CH2]C)]")

Best regards,
Chris

On Sun, 9 Feb 2020 at 08:28, Chris Earnshaw  wrote:

> Hi
>
> I've always regarded it as dangerous to rely on the use of explicit
> hydrogens in search queries and pattern matches. I think it's generally
> safer to use H-count properties in your SMARTS. In your example case this
> will require the use of recursive SMARTS to allow matching of the CH3 and
> CH2Cn fragments you're interested in. The SMARTS
> "[CH](=O)O[$(CH3);$([CH2]C)]" should do what you want. The [CH] forces it
> to only match formate esters. The recursive SMARTS [$(CH3);$([CH2]C)] can
> be interpreted as 'an atom which is EITHER an aliphatic carbon with 3
> hydrogens OR an aliphatic carbon with two hydrogens and an attached
> aliphatic carbon'. It's possible to build very powerful queries using this
> kind of approach, and it's not necessary to add explicit Hs to make it work.
>
> from rdkit import Chem
> mols = [Chem.MolFromSmiles(s) for s in ["C(=O)OC", "C(=O)OCC",
> "C(=O)OCCC"]]
> pat1 = Chem.MolFromSmarts("[CH](=O)O[$(CH3);$([CH2]C)]")
> [mol.HasSubstructMatch(mol, pat1) for mol in mols]
> [True, True, True]
>
> All the best,
> Chris
>
> On Sat, 8 Feb 2020 at 20:29, Andrew Dalke 
> wrote:
>
>> On Feb 8, 2020, at 17:55, Janusz Petkowski  wrote:
>> >
>> > If not how can I match cases where in a given position there can be C
>> or H with rdkit?
>>
>> I believe you should use #1 instead of H.
>>
>>
>> >>> from rdkit import Chem
>> >>> mols = [Chem.MolFromSmiles(s) for s in ["C(=O)OC", "C(=O)OCC",
>> "C(=O)OCCC"]]
>> >>> hmols = [Chem.AddHs(mol) for mol in mols]
>>
>>
>>   Your pattern:
>>
>> >>> pat1 = Chem.MolFromSmarts("[H]C(=O)OC([C,H])([H])[H]")
>> >>> [mol.HasSubstructMatch(pat1) for mol in hmols]
>> [False, True, True]
>>
>>   Using #1 instead of H:
>>
>> >>> pat2 = Chem.MolFromSmarts("[H]C(=O)OC([C,#1])([#1])[#1]")
>> >>> [mol.HasSubstructMatch(pat2) for mol in hmols]
>> [True, True, True]
>>
>>
>> "H" has an odd interpretation.
>> https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html says:
>>
>> Note that atomic primitive H can have two meanings,
>> implying a property or the element itself. [H] means
>> hydrogen atom. [*H2] means any atom with exactly
>> two hydrogens attached
>>
>> I believe the goal of having [H] match a hydrogen atom is to allow a
>> SMILES, when interpreted as a SMARTS, to be able to match the SMILES when
>> interpreted as a molecule. I'm not sure about that though.
>>
>> Cheers,
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] AdditionalOutput from FingerprintGenerator

2020-03-17 Thread Chris Earnshaw
A quick comment on the cosine metric. Unlike Tanimoto it obeys the triangle
inequality, so in cases where it's used essentially as a distance metric
(e.g. some clustering applications) the results are probably more
mathematically correct. I used it a lot in that context. Whether it makes
any real difference in practical terms is of course questionable as the
fingerprints themselves are only very approximate descriptors.

All the best,
Chris

On Tue, 17 Mar 2020 at 07:28, Greg Landrum  wrote:

> Hi Jason,
>
> On Mon, Mar 16, 2020 at 1:26 PM Jason Biggs  wrote:
>
>> Thank you again Greg.  If you have time to get this in the upcoming
>> release great, do not rush on my account.
>>
>
>  I spent some time looking at this tomorrow and it's not going to be a
> quick one: it'll require some thought and refactoring to make the bit info
> work in all circumstances. That means that it won't make it into the
> upcoming release.
>
>
>> I have another couple of questions regarding fingerprints in general and
>> the fingerprint generators in particular.
>>
>>
>>- To what degree do people use the different fingerprint types? Is it
>>more common to use the RDKit fingerprint, for example, as a bit vector, 
>> and
>>the Morgan fingerprint as a counts vector?  Does it depend on the
>>application or is it more how a particular fingerprint was historically
>>used?
>>
>> It's hard to be really sure, but I would guess that the Morgan
> fingerprints are the most used. I'm also going to guess, and this is based
> on even more of a gut feeling, that people are using bit vectors, not count
> vectors. As for why... good question. Probably because the Morgan
> fingerprints tend to work well in general (though there definitely is no
> "best" fingerprint
> https://link.springer.com/article/10.1186/1758-2946-5-26,
> http://pubs.acs.org/doi/abs/10.1021/ci400466r), give results that look
> "chemically similar" and there's lots of sample code around.
>
>>
>>- I notice there is a wider variety of distance measures available
>>for bit vectors than for count vectors. Is this because these measures, 
>> the
>>McConnaughey similarity for example, aren't extendable to multisets in the
>>same way that Tversky similarity can? Or is it just that there hasn't been
>>any demand for non-bitvector versions of the measures in BitOps.h?
>>
>> Aeons ago when I wrote that code I wanted to be sure to have as many
> possible metrics as possible available. Since then it's become clear that
> Tanimoto/Dice (you can prove that these rank results exactly the same) and
> Tversky (because it allows you to do asymmetric measures) cover most every
> need. I've also seen people use cosine similarity for comparing molecules
> of different sizes (though asymmetric Tversky lets you do the same).
>
>>
>>- Would it be useful to people for the FingerprintGenerator class to
>>return the list of atom invariants (or environments) used?  Or is that 
>> what
>>the BitInfo is used for?
>>
>> there are generators available for the atom and bond invariants of each
> of the fingerprint types. The fingerprint generators don't have a method
> available that allows you to retrieve the atom/bond invariant generators
> that they are using, but we could add this if it would be useful.
>
> -greg
>
>
>
>
>> Best,
>> Jason
>>
>>
>>
>> On Fri, Mar 13, 2020 at 11:13 PM Greg Landrum 
>> wrote:
>>
>>> Unfortunately it looks like the additional outputs for morgan, and rdkit
>>> fingerprints are parts that weren't finished:
>>>
>>> https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Fingerprints/MorganGenerator.cpp#L143
>>>
>>> https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Fingerprints/RDKitFPGenerator.cpp#L99
>>>
>>> I will take a look and see if it's possible to get these into the next
>>> release. In the meantime, if you want that info it looks like you'll need
>>> to use the older fingerprinting functions.
>>>
>>> -greg
>>>
>>> On Fri, Mar 13, 2020 at 11:10 PM Jason Biggs 
>>> wrote:
>>>
 Thank you Greg.

 I am working in C++.  I can poke around with this if I knew which
 members of the AdditionalOutput struct are used by which fingerprint
 generators.  I just wanted to make sure there wasn't an explanation
 somewhere I missed.

 I can see that with the AtomPairs fingerprints I can do the following

 //mol is an *ROMol and fpg is a *FingerprintGenerator
 RDKit::AdditionalOutput ao;

 std::vector> atomtobits(mol->getNumAtoms());
 ao.atomToBits = &atb;

 auto res = fpg->getSparseCountFingerprint(*mo, nullptr, nullptr, -1,
 &ao);

 after which atomtobits contains a list of bits for each atom.  From the
 comments I think the bitInfo member should be used by the
 RDKitFingerprintGenerator, but I don't see where it is used in the code.
 Is that the part that wasn't finished?  Is it possible to get information
 about the atoms/environments