Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

2020-01-22 Thread Thomas Strunz
Hi Greg,

reopening this old question. I can see that there are potential differences 
between rdkit version and especially Linux and Windows but let's lieave that 
aside for now.

After further "playing around" however I really have the impression there is a 
real issue with running rdkit (or python?) in a virtualized operating sytem. 
Since most production software and/or when using the cloud will mostly run in a 
virtualized operating system, I think this should be a fairly relevant topic 
worth investigation. As you showed yourself, the AWS System also was fairly 
slow.

For following observations I'm keeping the same datasets as before which is 
from your blog post ( /Regress/Scripts/fingerprint_screenout.py). basically 
it's that code slightly adapted:

mols = []
with gzip.open(data_dir + 'chembl21_25K.pairs.txt.gz', 'rb') as inf:
for line in inf:
line = line.decode().strip().split()
smi1 = line[1]
smi2 = line[3]
m1 = Chem.MolFromSmiles(smi1)
m2 = Chem.MolFromSmiles(smi2)
mols.append(m1)
mols.append(m2)

frags = [Chem.MolFromSmiles(x.split()[0]) for x in open(data_dir + 
'zinc.frags.500.q.smi', 'r')]

mfps = [Chem.PatternFingerprint(m, 512) for m in mols]
fragsfps = [Chem.PatternFingerprint(m, 512) for m in frags]

%%timeit -n1 -r1
for i, fragfp in enumerate(fragsfps):
hits = 0
for j, mfp in enumerate(mfps):
if DataStructs.AllProbeBitsMatch(fragfp, mfp):
if mols[j].HasSubstructMatch(frags[i]):
hits = hits + 1


I want to focus on the last cell and namley the "AllProbeBitsMatch" method:

%%timeit
DataStructs.AllProbeBitsMatch(fragsfps[10], mfps[10])

Results:

Windows 10 native i7-8850H: 567 
ns ± 5.48 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Lubuntu 16.04 virtualized i7-8850H: 1.81 µs 
± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) // the high 
variation is consistent
Windows Server 2012 R2 virtualized Xeon E5-2620 v4:1.18 µs ± 4.09 ns per 
loop (mean ± std. dev. of 7 runs, 100 loops each)

So it seems virtualization seems to reduce  the performance of this specific 
method by half which is also what I see by running the full substructure search 
code which takes double the time on the virtualized machines. (The windows 
server actually runs on ESX (eg type 1 hypervisor) while the Lubuntu VM is a 
type 2 (Vmware workstation) but both seem to suffer the same.).

we can try same thing with

%%timeit
mols[10].HasSubstructMatch(frags[10])

The difference here is smaller but VMs also take >50% more time.

So there seems to be a consistent large performance impact in VMs.

Of course the VM will be a bit slower but not by that much? What am I missing? 
Other experiences?

Best Regards,

Thomas

Von: Greg Landrum 
Gesendet: Montag, 16. Dezember 2019 17:10
An: Thomas Strunz 
Cc: rdkit-discuss@lists.sourceforge.net 
Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: 
PatternFingerprinter, Windows, Linux and Virtual machines

Hi Thomas,

First it is important to compare equivalent major versions to each other. 
Particularly in this case. On my linux box generating the pattern fingerprints 
takes 24.2 seconds with v2019.03.x and 15.9 seconds with v2019.09.x (that's due 
to the improvements in the substructure matcher that the blog post you link to 
discusses).

Comparing the same versions to each other:

Performance on windows vs linux
Windows performance with the RDKit has always lagged behind linux performance. 
There's something in the code (or in the way we use the compiler) that leads to 
big differences on some benchmarks. The most straightforward way I can 
demonstrate this is with results from my windows 10 laptop.
Here's the output when running the fingerprint_screenout.py benchmark using the 
windows build:
| 2019.09.1 | 13.6 | 0.3 | 38.1 | 0.8 | 25.5 | 25.9 | 84.1 |
and here's the output from a linux build running on the Windows Linux Subsystem:
| 2019.09.2 | 10.7 | 0.2 | 19.3 | 0.4 | 19.4 | 19.2 | 53.2 |
You can see the differences are not small.
I haven't invested massive time into it, but I haven't been able to figure out 
what causes this.

Performance on (linux) VMs
I can't think of any particular reason why there should be huge differences and 
it's really difficult to compare apples to apples here.
Since I have the numbers, here's one comparison

Here's a run on my linux workstation:
| 2019.09.2 | 7.6 | 0.3 | 15.9 | 0.4 | 21.4 | 20.4 | 55.7 |
and here's the same thing on an AWS t3.xlarge instance:
| 2019.09.2 | 9.6 | 0.2 | 20.3 | 0.4 | 38.4 | 38.2 | 94.7 |
The VM is significantly slower, but t3.xlarge an instance type that's intended 
to be used for compute intensive jobs (I don't have on of those active and 
configured at the moment).

Does that help at all?
-greg


On Mon, Dec 16, 2019 at 8:27 AM Thomas Strunz 

[Rdkit-discuss] Highlighting some parts of a structure

2020-01-22 Thread Alexis Parenty
Hi everyone,

I use SimilarityMaps.GetSimilarityMapFromWeights(mol, atom_ids) to
highlight some parts of a structure, but is it also possible to change the
thickness of some bonds of a structure knowing their atom IDs? If selected
bonds cannot be bold, can we change their color?

Many thanks and regards,

Alexis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] acepentalene aromaticity perception

2020-01-22 Thread Peter S. Shenkin
Hi,

I still believe that Acepentalene should not be recognized by RDKit as
aromatic, because there is no ring that contains 4n+2 electrons. The fact
that counting bonds not in the outer ring gives 10 electrons should not
make the outer ring aromatic. Moreover, RDKit seems to perceive aromaticity
correctly (using this criterion) in several similar systems which have 4n+2
electrons in the outer ring but in which counting additional electrons in
bonds not in the ring would, following Greg's interpretation, make them
non-aromatic.

Here are several examples, starting with a recap of Acepentalene.

For Acepentalene itself, as Greg pointed out, there are 9 electrons in the
outer ring, but since no single ring contains 4n+2 electrons, I do not
believe it should be considered aromatic.

[image: acepentalene.png]

---
Now consider the closely related compound created by making one of those
5-membered rings a 7-membered ring. It's called aceazulylene (yes, I had to
look this up! ). The outer ring is aromatic in my view, because it has 10
(i.e., 4n+2) electrons. The ring system has 12 electrons, and so it would
seem that based on Greg's discusiion of acepentalene, it should be
perceived as non-aromatic. Yet RDKit perceives it as aromatic — correctly,
in my view. As a side issue, I would have thought that, as in azulene, the
internal bonds and the central carbon would not have been perceived as
aromatic by RDKit; this is the same issue that Andrew originally raised for
Acepentalene.

[image: aceazulylene.png]

---
Now consider dicyclopenta[cd,gh]pentalene from Schleyer's paper (referenced
in Andrew Dalke's recent email). Again, this molecule has 12 electrons in
total, so that again, based on Greg's discussion of acepentalene, I'd have
thought RDKit would consider it non-aromatic. But the outer ring consists
of a pi system containing 4n+2 electrons, and so, in my view, it should be
considered aromatic. Schleyer's calculations agree. And again, as in
aceazulylene, RDKit in fact correctly perceives it as aromatic, although,
as in aceazulylene,, the internal bonds and carbons should probably not be
perceived as aromatic.

[image: dicyclopenta[cd,gh]pentalene.png]

As a closing comment, it seems to me that if ring bonds are counted and
off-ring bonds are ignored, electron counting would correctly infer the
aromaticity or not of these compounds. MO calculations, as per Schleyer,
would not be required for this purpose – at least for these compounds!

-P.

On Wed, Jan 22, 2020 at 8:50 AM Andrew Dalke 
wrote:

> On Jan 22, 2020, at 14:12, Greg Landrum  wrote:
> > As an aside: it's not particularly relevant to this discussion, but I
> don't understand why the wikipedia page says that the compound is
> anti-aromatic. I think the standard definition of anti-aromaticity (agrees
> with the one linked to from the acepentalene page) requires the ring system
> to have 4n electrons. That definitely doesn't apply here to either the
> individual rings or the system as a whole. The system as a whole has 10
> electrons (4n+2), the individual rings each have 5 (neither aromatic nor
> anti-aromatic), and the outer envelope has 9 (again, neither aromatic nor
> anti-aromatic).
>
> Because I didn't know either, I looked into it.
>
> I think that's because (to quote "Towards experimental determination of
> conical intersection properties:a twin state based comparison with bound
> excited states", Phys. Chem. Chem. Phys., 2011,13, 11872–11877 [*] )
>
> > A Hückel MO analysis[21] leads to the conclusion that the ground state
> of the conjugated tricyclic acepentalene I is a triplet state. DFT
> calculations corrected this picture and showed a singlet global minimum
> distorted to C_s symmetry with alternated single and double bonds,[22]
> which are well described by the Lewis structures A(B,C). According to a
> B3LYP/6-31G* calculation the lowest triplet state has also a high symmetric
> C_3v configuration and lies 3.9 kcal/mol above the singlet ground state
> minimum. Acepentalene I was characterized as an antiaromatic system [23]
> despite being formally an aromatic 10 electron system: the resonance
> between each pair of Kekule structures in this case involves only 4
> electron pairs of the pentalene fragments and it averts the resonance with
> the additional fifth electron pair common for both the structures. Such a
> resonance is described as an anti-combination of two Kekule structures:
> (A–B), (C–B) and (C–A).
>
> Just need to add B3LYP/6-31G* calculations to RDKit's aromaticity
> perception algorithm and everything will be fine. :)
>
> The "characterized as an antiaromatic system[23]" is "T. K. Zywietz, H.
> Jiao, P. v. R. Schleyer and A. de Meijere, J. Org.Chem., 1998, 63, 3417" at
> https://pubs.acs.org/doi/abs/10.1021/jo980089f .
>
>
> Cheers,
>
> Andrew
> da...@dalkescientific.com
>
> [*]
> 

[Rdkit-discuss] last call for mmpdb funding

2020-01-22 Thread Andrew Dalke
Hi all,

  This is the last email I'll send asking for people and organizations to join 
the current mmpdb crowdsourcing effort.

I've discussed it several times before here. In summary, I'm looking for 
crowdfunding for the matched molecular pair program 'mmpdb'. This is part of a 
test to find alternative ways to raise money for open source development in 
cheminformatics. See http://mmpdb.dalkescientific.com/ for details.

Currently I've pleased to say the effort has raised EUR 17 500. This is enough 
to fully finish off Postgres support and the new 'proprulecat', and pay back 
for the time taken to organize this funding effort.

In addition, it passed the EUR 16 000 goal, which means that next month I'll 
work change mmpdb so it stores the environment as a more easily interpretable 
fragment SMILES, rather than a hashed version of the circular environments.

The next funding goal is EUR 23 000, which is EUR 5 500 away. If I reach that 
funding goal, I will commit make a public/no-cost release of the new code in 
Oct. 2020.
 
I've extended the deadline to join to 15 February because some additional 
marketing will go out at the end of this month, and I want to give recipients a 
chance to participate.

People can still purchase mmpdb after the deadline is reached. Those goals 
merely represent specific commitments from me to work on mmpdb.

For those interested in budget, or who think that EUR 17 500 is already a large 
amount of money.

EUR 17 500 is corporate income, not salary income. About 50% of that goes to 
payroll taxes. The average software developer salary here in Sweden is about 
EUR 50 000, so EUR 9 000 is about 2 months of time. (I would be making above 
average salary should I decide to go corporate.) I spent several weeks working 
on the web site - which is essential marketing for crowdsourcing - and paid a 
web designer to help. I've also spent a few weeks working on improvements 
already delivered to customers. Even invoicing takes time.

But the goal of this effort isn't for me to be rich - though that would be 
nice. It's to see if an open-source project like mmpdb can be economically 
self-sustaining though funding by users interested in paying for specific new 
features and support.

Best regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] acepentalene aromaticity perception

2020-01-22 Thread Andrew Dalke
On Jan 22, 2020, at 14:12, Greg Landrum  wrote:
> As an aside: it's not particularly relevant to this discussion, but I don't 
> understand why the wikipedia page says that the compound is anti-aromatic. I 
> think the standard definition of anti-aromaticity (agrees with the one linked 
> to from the acepentalene page) requires the ring system to have 4n electrons. 
> That definitely doesn't apply here to either the individual rings or the 
> system as a whole. The system as a whole has 10 electrons (4n+2), the 
> individual rings each have 5 (neither aromatic nor anti-aromatic), and the 
> outer envelope has 9 (again, neither aromatic nor anti-aromatic).

Because I didn't know either, I looked into it.

I think that's because (to quote "Towards experimental determination of conical 
intersection properties:a twin state based comparison with bound excited 
states", Phys. Chem. Chem. Phys., 2011,13, 11872–11877 [*] )

> A Hückel MO analysis[21] leads to the conclusion that the ground state of the 
> conjugated tricyclic acepentalene I is a triplet state. DFT calculations 
> corrected this picture and showed a singlet global minimum distorted to C_s 
> symmetry with alternated single and double bonds,[22] which are well 
> described by the Lewis structures A(B,C). According to a B3LYP/6-31G* 
> calculation the lowest triplet state has also a high symmetric C_3v 
> configuration and lies 3.9 kcal/mol above the singlet ground state minimum. 
> Acepentalene I was characterized as an antiaromatic system [23] despite being 
> formally an aromatic 10 electron system: the resonance between each pair of 
> Kekule structures in this case involves only 4 electron pairs of the 
> pentalene fragments and it averts the resonance with the additional fifth 
> electron pair common for both the structures. Such a resonance is described 
> as an anti-combination of two Kekule structures: (A–B), (C–B) and (C–A).

Just need to add B3LYP/6-31G* calculations to RDKit's aromaticity perception 
algorithm and everything will be fine. :)

The "characterized as an antiaromatic system[23]" is "T. K. Zywietz, H. Jiao, 
P. v. R. Schleyer and A. de Meijere, J. Org.Chem., 1998, 63, 3417" at 
https://pubs.acs.org/doi/abs/10.1021/jo980089f .


Cheers,

Andrew
da...@dalkescientific.com

[*] 
https://www.researchgate.net/profile/Shmuel_Zilberg/publication/51175586_Towards_experimental_determination_of_conical_intersection_properties_A_twin_state_based_comparison_with_bound_excited_states/links/561bb5bc08ae6d17308b037f/Towards-experimental-determination-of-conical-intersection-properties-A-twin-state-based-comparison-with-bound-excited-states.pdf



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] acepentalene aromaticity perception

2020-01-22 Thread Peter S. Shenkin
Hi,

For aromaticity, I believe a ring has to have 4n+2 electrons along its
periphery.

I would be curious to know what other SMILES generators make of this
system.

-P.

On Wed, Jan 22, 2020 at 8:14 AM Greg Landrum  wrote:

> Hi Andrew,
>
> There's a bug here.
>
> Here's what I believe is happening:
> The system as a whole has 10 pi electrons, so the RDKit perceives it as
> aromatic. But then the logic that is used to flag the fusing bond in
> azulene as single (instead of aromatic) prevents the bonds between the
> central atom and the outer ones from being flagged as aromatic. This is
> clearly wrong. Now we just need to figure out how to fix it. :-)
>
> As an aside: it's not particularly relevant to this discussion, but I
> don't understand why the wikipedia page says that the compound is
> anti-aromatic. I think the standard definition of anti-aromaticity (agrees
> with the one linked to from the acepentalene page) requires the ring system
> to have 4n electrons. That definitely doesn't apply here to either the
> individual rings or the system as a whole. The system as a whole has 10
> electrons (4n+2), the individual rings each have 5 (neither aromatic nor
> anti-aromatic), and the outer envelope has 9 (again, neither aromatic nor
> anti-aromatic).
>
> Sorry for the super slow reply.
> -greg
>
>
> On Thu, Jan 9, 2020 at 9:56 PM Andrew Dalke 
> wrote:
>
>> Hi all,
>>
>> Could someone explain the following, which uses the SMILES from
>> https://en.wikipedia.org/wiki/Acepentalene :
>>
>> >>> from rdkit import Chem
>> >>> Chem.CanonSmiles("C1=CC2=CC=C3C2=C1C=C3")
>> 'c1cc2ccc3ccc1-c=3-2'
>> >>> import rdkit
>> >>> rdkit.__version__
>> '2019.09.1'
>>
>> I don't understand the aromatic "c" in the fused center of the 3
>> 5-membered rings. It's connected by non-aromatic bonds to the rest of the
>> system.
>>
>> This broke some code of mine which expects that every aromatic atom must
>> have at least two aromatic bonds. I thought that all aromatic atoms had to
>> be in aromatic rings, and that all aromatic rings had to have aromatic bond.
>>
>> (I'm ignoring RDKit's support for aromatic triple bonds in this
>> description.)
>>
>> I searched for "acepentalene" and "antiaromatic" in the issue tracker and
>> the mailing list but found nothing relevant.
>>
>> Cheers,
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
-P.
Sent from a cell phone. Pls forgive brvty and m1$tea@ks.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] acepentalene aromaticity perception

2020-01-22 Thread Greg Landrum
Hi Andrew,

There's a bug here.

Here's what I believe is happening:
The system as a whole has 10 pi electrons, so the RDKit perceives it as
aromatic. But then the logic that is used to flag the fusing bond in
azulene as single (instead of aromatic) prevents the bonds between the
central atom and the outer ones from being flagged as aromatic. This is
clearly wrong. Now we just need to figure out how to fix it. :-)

As an aside: it's not particularly relevant to this discussion, but I don't
understand why the wikipedia page says that the compound is anti-aromatic.
I think the standard definition of anti-aromaticity (agrees with the one
linked to from the acepentalene page) requires the ring system to have 4n
electrons. That definitely doesn't apply here to either the individual
rings or the system as a whole. The system as a whole has 10 electrons
(4n+2), the individual rings each have 5 (neither aromatic nor
anti-aromatic), and the outer envelope has 9 (again, neither aromatic nor
anti-aromatic).

Sorry for the super slow reply.
-greg


On Thu, Jan 9, 2020 at 9:56 PM Andrew Dalke 
wrote:

> Hi all,
>
> Could someone explain the following, which uses the SMILES from
> https://en.wikipedia.org/wiki/Acepentalene :
>
> >>> from rdkit import Chem
> >>> Chem.CanonSmiles("C1=CC2=CC=C3C2=C1C=C3")
> 'c1cc2ccc3ccc1-c=3-2'
> >>> import rdkit
> >>> rdkit.__version__
> '2019.09.1'
>
> I don't understand the aromatic "c" in the fused center of the 3
> 5-membered rings. It's connected by non-aromatic bonds to the rest of the
> system.
>
> This broke some code of mine which expects that every aromatic atom must
> have at least two aromatic bonds. I thought that all aromatic atoms had to
> be in aromatic rings, and that all aromatic rings had to have aromatic bond.
>
> (I'm ignoring RDKit's support for aromatic triple bonds in this
> description.)
>
> I searched for "acepentalene" and "antiaromatic" in the issue tracker and
> the mailing list but found nothing relevant.
>
> Cheers,
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss