Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines
Hi Greg, reopening this old question. I can see that there are potential differences between rdkit version and especially Linux and Windows but let's lieave that aside for now. After further "playing around" however I really have the impression there is a real issue with running rdkit (or python?) in a virtualized operating sytem. Since most production software and/or when using the cloud will mostly run in a virtualized operating system, I think this should be a fairly relevant topic worth investigation. As you showed yourself, the AWS System also was fairly slow. For following observations I'm keeping the same datasets as before which is from your blog post ( /Regress/Scripts/fingerprint_screenout.py). basically it's that code slightly adapted: mols = [] with gzip.open(data_dir + 'chembl21_25K.pairs.txt.gz', 'rb') as inf: for line in inf: line = line.decode().strip().split() smi1 = line[1] smi2 = line[3] m1 = Chem.MolFromSmiles(smi1) m2 = Chem.MolFromSmiles(smi2) mols.append(m1) mols.append(m2) frags = [Chem.MolFromSmiles(x.split()[0]) for x in open(data_dir + 'zinc.frags.500.q.smi', 'r')] mfps = [Chem.PatternFingerprint(m, 512) for m in mols] fragsfps = [Chem.PatternFingerprint(m, 512) for m in frags] %%timeit -n1 -r1 for i, fragfp in enumerate(fragsfps): hits = 0 for j, mfp in enumerate(mfps): if DataStructs.AllProbeBitsMatch(fragfp, mfp): if mols[j].HasSubstructMatch(frags[i]): hits = hits + 1 I want to focus on the last cell and namley the "AllProbeBitsMatch" method: %%timeit DataStructs.AllProbeBitsMatch(fragsfps[10], mfps[10]) Results: Windows 10 native i7-8850H: 567 ns ± 5.48 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) Lubuntu 16.04 virtualized i7-8850H: 1.81 µs ± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) // the high variation is consistent Windows Server 2012 R2 virtualized Xeon E5-2620 v4:1.18 µs ± 4.09 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) So it seems virtualization seems to reduce the performance of this specific method by half which is also what I see by running the full substructure search code which takes double the time on the virtualized machines. (The windows server actually runs on ESX (eg type 1 hypervisor) while the Lubuntu VM is a type 2 (Vmware workstation) but both seem to suffer the same.). we can try same thing with %%timeit mols[10].HasSubstructMatch(frags[10]) The difference here is smaller but VMs also take >50% more time. So there seems to be a consistent large performance impact in VMs. Of course the VM will be a bit slower but not by that much? What am I missing? Other experiences? Best Regards, Thomas Von: Greg Landrum Gesendet: Montag, 16. Dezember 2019 17:10 An: Thomas Strunz Cc: rdkit-discuss@lists.sourceforge.net Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines Hi Thomas, First it is important to compare equivalent major versions to each other. Particularly in this case. On my linux box generating the pattern fingerprints takes 24.2 seconds with v2019.03.x and 15.9 seconds with v2019.09.x (that's due to the improvements in the substructure matcher that the blog post you link to discusses). Comparing the same versions to each other: Performance on windows vs linux Windows performance with the RDKit has always lagged behind linux performance. There's something in the code (or in the way we use the compiler) that leads to big differences on some benchmarks. The most straightforward way I can demonstrate this is with results from my windows 10 laptop. Here's the output when running the fingerprint_screenout.py benchmark using the windows build: | 2019.09.1 | 13.6 | 0.3 | 38.1 | 0.8 | 25.5 | 25.9 | 84.1 | and here's the output from a linux build running on the Windows Linux Subsystem: | 2019.09.2 | 10.7 | 0.2 | 19.3 | 0.4 | 19.4 | 19.2 | 53.2 | You can see the differences are not small. I haven't invested massive time into it, but I haven't been able to figure out what causes this. Performance on (linux) VMs I can't think of any particular reason why there should be huge differences and it's really difficult to compare apples to apples here. Since I have the numbers, here's one comparison Here's a run on my linux workstation: | 2019.09.2 | 7.6 | 0.3 | 15.9 | 0.4 | 21.4 | 20.4 | 55.7 | and here's the same thing on an AWS t3.xlarge instance: | 2019.09.2 | 9.6 | 0.2 | 20.3 | 0.4 | 38.4 | 38.2 | 94.7 | The VM is significantly slower, but t3.xlarge an instance type that's intended to be used for compute intensive jobs (I don't have on of those active and configured at the moment). Does that help at all? -greg On Mon, Dec 16, 2019 at 8:27 AM Thomas Strunz
[Rdkit-discuss] Highlighting some parts of a structure
Hi everyone, I use SimilarityMaps.GetSimilarityMapFromWeights(mol, atom_ids) to highlight some parts of a structure, but is it also possible to change the thickness of some bonds of a structure knowing their atom IDs? If selected bonds cannot be bold, can we change their color? Many thanks and regards, Alexis ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] acepentalene aromaticity perception
Hi, I still believe that Acepentalene should not be recognized by RDKit as aromatic, because there is no ring that contains 4n+2 electrons. The fact that counting bonds not in the outer ring gives 10 electrons should not make the outer ring aromatic. Moreover, RDKit seems to perceive aromaticity correctly (using this criterion) in several similar systems which have 4n+2 electrons in the outer ring but in which counting additional electrons in bonds not in the ring would, following Greg's interpretation, make them non-aromatic. Here are several examples, starting with a recap of Acepentalene. For Acepentalene itself, as Greg pointed out, there are 9 electrons in the outer ring, but since no single ring contains 4n+2 electrons, I do not believe it should be considered aromatic. [image: acepentalene.png] --- Now consider the closely related compound created by making one of those 5-membered rings a 7-membered ring. It's called aceazulylene (yes, I had to look this up! ). The outer ring is aromatic in my view, because it has 10 (i.e., 4n+2) electrons. The ring system has 12 electrons, and so it would seem that based on Greg's discusiion of acepentalene, it should be perceived as non-aromatic. Yet RDKit perceives it as aromatic — correctly, in my view. As a side issue, I would have thought that, as in azulene, the internal bonds and the central carbon would not have been perceived as aromatic by RDKit; this is the same issue that Andrew originally raised for Acepentalene. [image: aceazulylene.png] --- Now consider dicyclopenta[cd,gh]pentalene from Schleyer's paper (referenced in Andrew Dalke's recent email). Again, this molecule has 12 electrons in total, so that again, based on Greg's discussion of acepentalene, I'd have thought RDKit would consider it non-aromatic. But the outer ring consists of a pi system containing 4n+2 electrons, and so, in my view, it should be considered aromatic. Schleyer's calculations agree. And again, as in aceazulylene, RDKit in fact correctly perceives it as aromatic, although, as in aceazulylene,, the internal bonds and carbons should probably not be perceived as aromatic. [image: dicyclopenta[cd,gh]pentalene.png] As a closing comment, it seems to me that if ring bonds are counted and off-ring bonds are ignored, electron counting would correctly infer the aromaticity or not of these compounds. MO calculations, as per Schleyer, would not be required for this purpose – at least for these compounds! -P. On Wed, Jan 22, 2020 at 8:50 AM Andrew Dalke wrote: > On Jan 22, 2020, at 14:12, Greg Landrum wrote: > > As an aside: it's not particularly relevant to this discussion, but I > don't understand why the wikipedia page says that the compound is > anti-aromatic. I think the standard definition of anti-aromaticity (agrees > with the one linked to from the acepentalene page) requires the ring system > to have 4n electrons. That definitely doesn't apply here to either the > individual rings or the system as a whole. The system as a whole has 10 > electrons (4n+2), the individual rings each have 5 (neither aromatic nor > anti-aromatic), and the outer envelope has 9 (again, neither aromatic nor > anti-aromatic). > > Because I didn't know either, I looked into it. > > I think that's because (to quote "Towards experimental determination of > conical intersection properties:a twin state based comparison with bound > excited states", Phys. Chem. Chem. Phys., 2011,13, 11872–11877 [*] ) > > > A Hückel MO analysis[21] leads to the conclusion that the ground state > of the conjugated tricyclic acepentalene I is a triplet state. DFT > calculations corrected this picture and showed a singlet global minimum > distorted to C_s symmetry with alternated single and double bonds,[22] > which are well described by the Lewis structures A(B,C). According to a > B3LYP/6-31G* calculation the lowest triplet state has also a high symmetric > C_3v configuration and lies 3.9 kcal/mol above the singlet ground state > minimum. Acepentalene I was characterized as an antiaromatic system [23] > despite being formally an aromatic 10 electron system: the resonance > between each pair of Kekule structures in this case involves only 4 > electron pairs of the pentalene fragments and it averts the resonance with > the additional fifth electron pair common for both the structures. Such a > resonance is described as an anti-combination of two Kekule structures: > (A–B), (C–B) and (C–A). > > Just need to add B3LYP/6-31G* calculations to RDKit's aromaticity > perception algorithm and everything will be fine. :) > > The "characterized as an antiaromatic system[23]" is "T. K. Zywietz, H. > Jiao, P. v. R. Schleyer and A. de Meijere, J. Org.Chem., 1998, 63, 3417" at > https://pubs.acs.org/doi/abs/10.1021/jo980089f . > > > Cheers, > > Andrew > da...@dalkescientific.com > > [*] >
[Rdkit-discuss] last call for mmpdb funding
Hi all, This is the last email I'll send asking for people and organizations to join the current mmpdb crowdsourcing effort. I've discussed it several times before here. In summary, I'm looking for crowdfunding for the matched molecular pair program 'mmpdb'. This is part of a test to find alternative ways to raise money for open source development in cheminformatics. See http://mmpdb.dalkescientific.com/ for details. Currently I've pleased to say the effort has raised EUR 17 500. This is enough to fully finish off Postgres support and the new 'proprulecat', and pay back for the time taken to organize this funding effort. In addition, it passed the EUR 16 000 goal, which means that next month I'll work change mmpdb so it stores the environment as a more easily interpretable fragment SMILES, rather than a hashed version of the circular environments. The next funding goal is EUR 23 000, which is EUR 5 500 away. If I reach that funding goal, I will commit make a public/no-cost release of the new code in Oct. 2020. I've extended the deadline to join to 15 February because some additional marketing will go out at the end of this month, and I want to give recipients a chance to participate. People can still purchase mmpdb after the deadline is reached. Those goals merely represent specific commitments from me to work on mmpdb. For those interested in budget, or who think that EUR 17 500 is already a large amount of money. EUR 17 500 is corporate income, not salary income. About 50% of that goes to payroll taxes. The average software developer salary here in Sweden is about EUR 50 000, so EUR 9 000 is about 2 months of time. (I would be making above average salary should I decide to go corporate.) I spent several weeks working on the web site - which is essential marketing for crowdsourcing - and paid a web designer to help. I've also spent a few weeks working on improvements already delivered to customers. Even invoicing takes time. But the goal of this effort isn't for me to be rich - though that would be nice. It's to see if an open-source project like mmpdb can be economically self-sustaining though funding by users interested in paying for specific new features and support. Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] acepentalene aromaticity perception
On Jan 22, 2020, at 14:12, Greg Landrum wrote: > As an aside: it's not particularly relevant to this discussion, but I don't > understand why the wikipedia page says that the compound is anti-aromatic. I > think the standard definition of anti-aromaticity (agrees with the one linked > to from the acepentalene page) requires the ring system to have 4n electrons. > That definitely doesn't apply here to either the individual rings or the > system as a whole. The system as a whole has 10 electrons (4n+2), the > individual rings each have 5 (neither aromatic nor anti-aromatic), and the > outer envelope has 9 (again, neither aromatic nor anti-aromatic). Because I didn't know either, I looked into it. I think that's because (to quote "Towards experimental determination of conical intersection properties:a twin state based comparison with bound excited states", Phys. Chem. Chem. Phys., 2011,13, 11872–11877 [*] ) > A Hückel MO analysis[21] leads to the conclusion that the ground state of the > conjugated tricyclic acepentalene I is a triplet state. DFT calculations > corrected this picture and showed a singlet global minimum distorted to C_s > symmetry with alternated single and double bonds,[22] which are well > described by the Lewis structures A(B,C). According to a B3LYP/6-31G* > calculation the lowest triplet state has also a high symmetric C_3v > configuration and lies 3.9 kcal/mol above the singlet ground state minimum. > Acepentalene I was characterized as an antiaromatic system [23] despite being > formally an aromatic 10 electron system: the resonance between each pair of > Kekule structures in this case involves only 4 electron pairs of the > pentalene fragments and it averts the resonance with the additional fifth > electron pair common for both the structures. Such a resonance is described > as an anti-combination of two Kekule structures: (A–B), (C–B) and (C–A). Just need to add B3LYP/6-31G* calculations to RDKit's aromaticity perception algorithm and everything will be fine. :) The "characterized as an antiaromatic system[23]" is "T. K. Zywietz, H. Jiao, P. v. R. Schleyer and A. de Meijere, J. Org.Chem., 1998, 63, 3417" at https://pubs.acs.org/doi/abs/10.1021/jo980089f . Cheers, Andrew da...@dalkescientific.com [*] https://www.researchgate.net/profile/Shmuel_Zilberg/publication/51175586_Towards_experimental_determination_of_conical_intersection_properties_A_twin_state_based_comparison_with_bound_excited_states/links/561bb5bc08ae6d17308b037f/Towards-experimental-determination-of-conical-intersection-properties-A-twin-state-based-comparison-with-bound-excited-states.pdf ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] acepentalene aromaticity perception
Hi, For aromaticity, I believe a ring has to have 4n+2 electrons along its periphery. I would be curious to know what other SMILES generators make of this system. -P. On Wed, Jan 22, 2020 at 8:14 AM Greg Landrum wrote: > Hi Andrew, > > There's a bug here. > > Here's what I believe is happening: > The system as a whole has 10 pi electrons, so the RDKit perceives it as > aromatic. But then the logic that is used to flag the fusing bond in > azulene as single (instead of aromatic) prevents the bonds between the > central atom and the outer ones from being flagged as aromatic. This is > clearly wrong. Now we just need to figure out how to fix it. :-) > > As an aside: it's not particularly relevant to this discussion, but I > don't understand why the wikipedia page says that the compound is > anti-aromatic. I think the standard definition of anti-aromaticity (agrees > with the one linked to from the acepentalene page) requires the ring system > to have 4n electrons. That definitely doesn't apply here to either the > individual rings or the system as a whole. The system as a whole has 10 > electrons (4n+2), the individual rings each have 5 (neither aromatic nor > anti-aromatic), and the outer envelope has 9 (again, neither aromatic nor > anti-aromatic). > > Sorry for the super slow reply. > -greg > > > On Thu, Jan 9, 2020 at 9:56 PM Andrew Dalke > wrote: > >> Hi all, >> >> Could someone explain the following, which uses the SMILES from >> https://en.wikipedia.org/wiki/Acepentalene : >> >> >>> from rdkit import Chem >> >>> Chem.CanonSmiles("C1=CC2=CC=C3C2=C1C=C3") >> 'c1cc2ccc3ccc1-c=3-2' >> >>> import rdkit >> >>> rdkit.__version__ >> '2019.09.1' >> >> I don't understand the aromatic "c" in the fused center of the 3 >> 5-membered rings. It's connected by non-aromatic bonds to the rest of the >> system. >> >> This broke some code of mine which expects that every aromatic atom must >> have at least two aromatic bonds. I thought that all aromatic atoms had to >> be in aromatic rings, and that all aromatic rings had to have aromatic bond. >> >> (I'm ignoring RDKit's support for aromatic triple bonds in this >> description.) >> >> I searched for "acepentalene" and "antiaromatic" in the issue tracker and >> the mailing list but found nothing relevant. >> >> Cheers, >> >> Andrew >> da...@dalkescientific.com >> >> >> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- -P. Sent from a cell phone. Pls forgive brvty and m1$tea@ks. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] acepentalene aromaticity perception
Hi Andrew, There's a bug here. Here's what I believe is happening: The system as a whole has 10 pi electrons, so the RDKit perceives it as aromatic. But then the logic that is used to flag the fusing bond in azulene as single (instead of aromatic) prevents the bonds between the central atom and the outer ones from being flagged as aromatic. This is clearly wrong. Now we just need to figure out how to fix it. :-) As an aside: it's not particularly relevant to this discussion, but I don't understand why the wikipedia page says that the compound is anti-aromatic. I think the standard definition of anti-aromaticity (agrees with the one linked to from the acepentalene page) requires the ring system to have 4n electrons. That definitely doesn't apply here to either the individual rings or the system as a whole. The system as a whole has 10 electrons (4n+2), the individual rings each have 5 (neither aromatic nor anti-aromatic), and the outer envelope has 9 (again, neither aromatic nor anti-aromatic). Sorry for the super slow reply. -greg On Thu, Jan 9, 2020 at 9:56 PM Andrew Dalke wrote: > Hi all, > > Could someone explain the following, which uses the SMILES from > https://en.wikipedia.org/wiki/Acepentalene : > > >>> from rdkit import Chem > >>> Chem.CanonSmiles("C1=CC2=CC=C3C2=C1C=C3") > 'c1cc2ccc3ccc1-c=3-2' > >>> import rdkit > >>> rdkit.__version__ > '2019.09.1' > > I don't understand the aromatic "c" in the fused center of the 3 > 5-membered rings. It's connected by non-aromatic bonds to the rest of the > system. > > This broke some code of mine which expects that every aromatic atom must > have at least two aromatic bonds. I thought that all aromatic atoms had to > be in aromatic rings, and that all aromatic rings had to have aromatic bond. > > (I'm ignoring RDKit's support for aromatic triple bonds in this > description.) > > I searched for "acepentalene" and "antiaromatic" in the issue tracker and > the mailing list but found nothing relevant. > > Cheers, > > Andrew > da...@dalkescientific.com > > > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss