[Rdkit-discuss] rdkit-cartridge: Inserting new molecules

2020-10-25 Thread Thomas Strunz
Dear community,

I was wondering on how to best insert molecules into a mol field. The 
documentation only show how to insert from a preexisting table with a smiles 
column and then use "mol_to_smiles".

How can a molecule be inserted directly? eg what format need to be submitted?

Second point is how to make the insertion simple so that not all applications 
connecting to the DB need to be chemically aware (have rdkit available). I have 
played around with simply having a smiles column and mol field column and then 
use a trigger function to convert the smiles to a mol. But this duplicates all 
the data (and even more wasteful with ctab file). Is it possible to not 
duplicate the data and be able to insert smiles/ctab directly?

Best Regards,

Thomas
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Jupyter Notebook Structure image size

2020-09-28 Thread Thomas Strunz
Hi all,

how can I change the default structure image size in Jupyter notebook for a

- single molecule?
-  molecules in a pandastools dataframe?

default size is way too large for my taste (takes up too much screen space)

Best Regards,

Thomas
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] proper technical term for generating virtual compounds with rdkit and smarts

2020-09-25 Thread Thomas Strunz
Hi Brian,

commercial tools usually use the term "reaction-based enumeration" or 
"reaction-based library design".

Best Regards,

Thomas


Von: Bennion, Brian via Rdkit-discuss 
Gesendet: Freitag, 25. September 2020 07:19
An: RDKit Discuss 
Betreff: [Rdkit-discuss] proper technical term for generating virtual compounds 
with rdkit and smarts

hello

I have a paper in review and is intended for a large audience that has 
synthetic chemists, biologist and comp chem.
One reviewer had issues with the term in-silico syntheses.
I used rdkit and smarts reactions to generate large libraries of compounds for 
our research project.  Is there a better term to use?  I feel "chemical 
enumeration" is just as foreign.

The abstract is below.


The current standard treatment for organophosphate poisoning primarily relies 
on the use of small molecule-based oximes that can efficiently restore 
acetylcholinesterase (AChE) activity.  Despite their efficacy in reactivating 
AChE, the action of drugs like 2-pralidoxime (2-PAM) is primarily limited to 
the peripheral nervous system (PNS) and, thus, provides no protection to the 
central nervous system (CNS).  This lack of action in the CNS stems from the 
ionic nature of the drugs; they cannot cross the blood-brain barrier (BBB) to 
access to any nerve agent-inhibited AChE therein.  In this report, we present a 
small molecule oxime, called LLNL-02, that can diffuse across the BBB for 
reactivation of nerve agent-inhibited AChE in the CNS.  Our 
candidate-development approach utilizes a combination of parallel chemical and 
in - silico syntheses, computational modeling, and a battery of detailed in 
vitro and in vivo assessments that have identified LLNL-02 as a top CNS-active 
candidate against nerve agent poisoning.   Additional experiments to determine 
acute and chronic  toxicity as required for regulatory approval are ongoing.

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] TIL: Mol objects having varying attributes depending on rdkit imports

2020-09-23 Thread Thomas Strunz
Dear readers,

just wasted an amazing amount of time and reporting this in case someone else 
happens to run into this problem.

if you simply do:

from rdkit import Chem
m = Chem.MolFromSmiles('c1c1')
m.Compute2DCoords()

you will get an error

Mol object has no attribute Compute2DCoords.

Since I have previously done this many times and it worked, it was very 
confusing.

It seems that to have this attribute available one needs to also do

from rdkit import AllChem

and then it works. This also applies to ComputeGasteigerCharges and maybe even 
more methods.

So if you have this problem, you will now hopefully find this information via 
search and be able to solve it much faster than I was.

best regards,

Thomas

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-03-02 Thread Thomas Strunz
Hi Deepti,

for the spark part I say you simply generated all the fingerprints (locally or 
on the cluster) and store the generated list of fingerprints as pkl file. Then 
when running you test you simply load the picke file into memory. With 15GB 
memory and 2Mio molecules this should easily work out just fine, for a test 
obviously.
I have a simple web app that does exactly this albeit with only about 200k 
molecules using 400Mb of RAM which I assume most of it is from the 
fingerprints. This would mean the 2 mio fingerprints would only use about 4GB 
of RAM.

Still, it begs the question what this would be used for as obviously this 
approach doesn't scale at all and you would need some form of storing the 
fingerprints also on spark. Also if your goal is to do similarity searches with 
lots of fingerprints I suggest you have a look at ChemFP.

Best Regards,

Thomas


Von: Deepti Gupta via Rdkit-discuss 
Gesendet: Mittwoch, 26. Februar 2020 09:46
An: rdkit-discuss@lists.sourceforge.net ; 
Tim Dudgeon 
Betreff: Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

Hi Tim,

Thank you!

I'll be more detailed in my post, sorry about that. As this was a PoC, I had a 
spark cluster with 2 worker nodes with 4 vCPUs with disk size 500GB and memory 
15GB on Google Cloud. I timed the response against 2 million data points 
consisting of Chembl id, Smile structures.

Substructure search - 2 mins
Similarity search - 43 mins

PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500 GB and 
15GB memory. The value of shared_buffers = 2048MB  was edited in the  
postgresql.conf file.

Substructure search - within 5 secs
Similarity search - within 3 secs

I tried to store the converted molecules and fingerprints in a file to get 
better performance while trying the pyspark program but was not able to do so.

Regards,
DA

On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon 
 wrote:



I think you need to explain what benchmarks you are running and what is really 
meant by "faster".
And what hardware (for Spark how many nodes, how big; for PostgreSQL what size 
server, what settings esp. the shared_buffers setting).

A very obvious critique of what you reported is that what you describe as 
"running in Python" includes generating the fingerprints for each molecule on 
the fly, whereas for "the cartridge" these are already calculated, so will 
obviously be much faster (as the fingerprint generation dominates the compute).

Tim

On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,

I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where 
I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both 
and am trying some hands-on exercises to understand the differences. What I've 
understood that the structure searches are slower in Python (Spark Cluster) 
than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in 
this and maybe talking silly.

The similarity search using the below functions (example) -
Python methods -

fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False))
similarity = DataStructs.TanimotoSimilarity(fps1,fps2)

takes too long (45 minutes) for a 2 million file while the same thing is very 
quick (in seconds) on PostgreSQL
Database functions -

select count(*) from (select 
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
 as similarity from fingerprints join mols using (modality_id)) as fps where 
similarity between 0.45 and 0.50;

Does this conclude that for production workloads one must always use database 
cartridge only? Like RDKit, BINGO, etc.?

Regards,
DA




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit Cartridge mol_to_svg parameters

2020-02-13 Thread Thomas Strunz
OK, I dug into the source 
code<https://github.com/rdkit/rdkit/blob/d41752d558bf7200ab67b98cdd9e37f1bdd378de/Code/GraphMol/MolDraw2D/MolDraw2DUtils.cpp>
 and it seems bondLineWidth simply can't be set:


void updateDrawerParamsFromJSON(MolDraw2D , const std::string ) {
  if (json == "") {
return;
  }
  std::istringstream ss;
  ss.str(json);
  MolDrawOptions  = drawer.drawOptions();
  boost::property_tree::ptree pt;
  boost::property_tree::read_json(ss, pt);
  PT_OPT_GET(atomLabelDeuteriumTritium);
  PT_OPT_GET(dummiesAreAttachments);
  PT_OPT_GET(circleAtoms);
  PT_OPT_GET(continuousHighlight);
  PT_OPT_GET(flagCloseContactsDist);
  PT_OPT_GET(includeAtomTags);
  PT_OPT_GET(clearBackground);
  PT_OPT_GET(legendFontSize);
  PT_OPT_GET(multipleBondOffset);
  PT_OPT_GET(padding);
  PT_OPT_GET(additionalAtomLabelPadding);
  get_colour_option(, "highlightColour", opts.highlightColour);
  get_colour_option(, "backgroundColour", opts.backgroundColour);
  get_colour_option(, "legendColour", opts.legendColour);
  if (pt.find("atomLabels") != pt.not_found()) {
BOOST_FOREACH (boost::property_tree::ptree::value_type const ,
   pt.get_child("atomLabels")) {
  opts.atomLabels[boost::lexical_cast(item.first)] =
  item.second.get_value();
}
  }
}

legendFontSize indeed works. I was using a setting I had from python (0.5) 
which got silently ignored. But with a proper value in points size I assume it 
works.

So I suggest to add the bondLineWidth as option to above method.

Best Regards,

Thomas


Von: Thomas Strunz 
Gesendet: Donnerstag, 13. Februar 2020 09:09
An: rdkit-discuss@lists.sourceforge.net 
Betreff: [Rdkit-discuss] RDKit Cartridge mol_to_svg parameters

Hi All,

started to play around with the rdkit cartridge and I was wondering how to 
correctly use the mol_to_svg function.

On rdkit homepage I only found this:

mol_to_svg(mol,string default ‘’,int default 250, int default 200, string 
default ‘’) : returns an SVG with a drawing
of the molecule. The optional parameters are a string to use as the legend, the 
width of the image, the height of the image,
and a JSON with additional rendering parameters. (available from the 2016_09 
release)

The interesting part are the rendering parameters. Is there a list of them and 
some examples of this function?

I tried stuff like below:

mol_to_svg(mol, 'Test', 150, 100, '{"bondLineWidth": 1, "legendFontSize": 0.5}')

There is no error but the "JSON" options are not applied. The image always 
looks the same with for my taste too thick bonds.

Best Regards,

Thomas
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] RDKit Cartridge mol_to_svg parameters

2020-02-13 Thread Thomas Strunz
Hi All,

started to play around with the rdkit cartridge and I was wondering how to 
correctly use the mol_to_svg function.

On rdkit homepage I only found this:

mol_to_svg(mol,string default ‘’,int default 250, int default 200, string 
default ‘’) : returns an SVG with a drawing
of the molecule. The optional parameters are a string to use as the legend, the 
width of the image, the height of the image,
and a JSON with additional rendering parameters. (available from the 2016_09 
release)

The interesting part are the rendering parameters. Is there a list of them and 
some examples of this function?

I tried stuff like below:

mol_to_svg(mol, 'Test', 150, 100, '{"bondLineWidth": 1, "legendFontSize": 0.5}')

There is no error but the "JSON" options are not applied. The image always 
looks the same with for my taste too thick bonds.

Best Regards,

Thomas
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] How does rdkit cartridge work?

2020-01-26 Thread Thomas Strunz
Hi Greg,

about your comment on custom fingerprints:

"I'm not sure what you mean by substructures, but you can, from Python, use 
custom fingerprints in the cartridge. That's explained here: 
http://rdkit.blogspot.com/2017/04/using-custom-fingerprint-in-postgresql.html;


For me and I suspect this is what Changge actually was asking if one can add a 
custom fingerprint for the substructure search part (the screen-out) and not 
just the similarity search (the blog post linked looks like it is only for 
similarity search). And with "custom substructure" I would assume a fully or 
partial substructure fingerprint. In some cases the Patternfingerprint doesn't 
have very good screenout rates and layeredfingerprint is much better. (Not 
really an issue for me just an obeservation).So from that point I can see and 
understand the need for using a custom fingerprint for the screenout step.

Best Regards,

Thomas

Von: Greg Landrum 
Gesendet: Samstag, 25. Januar 2020 06:11
An: Chicago Ji 
Cc: rdkit-discuss 
Betreff: Re: [Rdkit-discuss] How does rdkit cartridge work?

Hi Changge,

On Fri, Jan 24, 2020 at 5:14 PM Chicago Ji 
mailto:chicago...@gmail.com>> wrote:

I find that rdkit cartridge is quite efficient in substructure searching.

Glad to hear that! :-)

Is there any paper or similar paper that describes things behind rdkit 
cartridge?

No, just the documentation.

For example, what kind of substructures were indexed?

The Pattern fingerprint is what's used to build the index for substructure 
searching. That fingerprint is described in the RDKit documentation here: 
https://www.rdkit.org/docs/RDKit_Book.html#pattern-fingerprints

Is there a way that the users can add custom defined fingerprints and 
substructures?

I'm not sure what you mean by substructures, but you can, from Python, use 
custom fingerprints in the cartridge. That's explained here: 
http://rdkit.blogspot.com/2017/04/using-custom-fingerprint-in-postgresql.html

Best,
-greg


Many thanks for your help!

Best,
Changge



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

2020-01-24 Thread Thomas Strunz
Hi Jan,

yeah this numpy test might not be ideal due to depending a lot on how numpy was 
built, namley the underlying BLAS library that was used. I suspect your Debian 
VM doesn't use openblas or mkl.
You can check that with

np.__config__.show()

For n = 6000 native (Windows 10) takes 2s and VM (Lubuntu 16.04) takes 2.5s. 
While the difference again is roughly 50%, I'm hesitant as here we really are 
not comparing apples to apples as windows uses Intel mkl vs openblas on linux.

Best Regards,

Thomas


Von: Jan Holst Jensen
Gesendet: Freitag, 24. Januar 2020 13:43
Bis: Thomas Strunz; Maciek Wójcikowski
Cc: rdkit-discuss@lists.sourceforge.net
Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: 
PatternFingerprinter, Windows, Linux and Virtual machines

Hi Thomas,

FWIW I ran your example code below on my VM host (CentOS 7.3, Intel(R) Xeon(R) 
CPU E3-1245 v6 @ 3.70GHz) and in a Linux VM (Debian 9).

n = 6000Host = 3.8 secsVM = 145 secs~40 times slower
n = 1000Host = 0.03 secsVM = 0.6 secs~20 times slower

So based on these timings, your 50% penalty in the VM sounds really good :-).

Now, the example maxes out all available cores on the host, but sticks to a 
single core in the VM. I don't know the reason for that, but perhaps differing 
build options for numpy on CentOS <> Debian ? That explains roughly a factor 8 
for me (CPU has 4 cores, 8 threads). Still, after correcting for active core 
count, the VM will end up taking 2 - 5 times as long as the host.

For other workloads I generally don't see such a dramatic difference; more like 
10-30% slower performance in VMs compared to native. Seems like you have hit a 
particular VM weak spot with your workload.

If container deployment is an alternative option instead of VM, perhaps that 
would improve matters ? Of course, that won't help you if you need to deploy on 
Windows.

Cheers
-- Jan

On 2020-01-24 09:30, Thomas Strunz wrote:
Hi Maciek,

yeah I thought that this could be the issue as well but according to the tools 
(grep flags /proc/cpuinfo | uniq) or coreinfo on windows the VMs also support 
sse4.2 (and lower) and AVX.

In fact I seem to have to look further as I noticed that in general python 
performance (and possible more, not tested) is much slower on the VMs. See 
below code which is actually a way to see performance impact of vector 
extension and especially of intel mkl.

import numpy as np
import time

n = 2
A = np.random.randn(n,n).astype('float64')
B = np.random.randn(n,n).astype('float64')

start_time = time.time()
nrm = np.linalg.norm(A@B)
print(" took {} seconds ".format(time.time() - start_time))
print(" norm = ",nrm)


Last code fragment runs about 50% slower on the Windows VM compared to my 
laptop accounting for clock and core count differences. It's confusing to me as 
the performance difference is so consitent and apparent but I would assume if 
this was normal people would have noticed a long time ago?Yet I can't find 
anything about it. Or does everyone run their code native?

Best Regards,

Thomas


Von: Maciek Wójcikowski <mailto:mac...@wojcikowski.pl>
Gesendet: Donnerstag, 23. Januar 2020 11:04
An: Thomas Strunz <mailto:beginn...@hotmail.de>
Cc: Greg Landrum <mailto:greg.land...@gmail.com>; 
rdkit-discuss@lists.sourceforge.net<mailto:rdkit-discuss@lists.sourceforge.net> 
<mailto:rdkit-discuss@lists.sourceforge.net>
Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: 
PatternFingerprinter, Windows, Linux and Virtual machines

Thomas,

Could you double check if your VM has the same set of instructions as your 
host? For hardware popcounts, which are used to accelerate fingerprint 
operations, they might have profound impact on performance. SSE4.2 is probably 
the one that is used in the RDKit (at least this is stated in the code).

For KVM https://www.linux-kvm.org/page/Tuning_KVM (there are linux commands to 
check what is available on guest, so might be helpful for you too).
It also seems that in VMWare world this might be tricky, as it is considered to 
be a stability hazard: 
https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vcenterhost.doc_50%2FGUID-8B226625-4923-410C-B7AF-51BCD2806A3B.html

Best,
Maciek


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl<mailto:mac...@wojcikowski.pl>


czw., 23 sty 2020 o 08:15 Thomas Strunz 
mailto:beginn...@hotmail.de>> napisał(a):
Hi Greg,

reopening this old question. I can see that there are potential differences 
between rdkit version and especially Linux and Windows but let's lieave that 
aside for now.

After further "playing around" however I really have the impression there is a 
real issue with running rdkit (or python?) in a virtualized operating sytem. 
Since most production software and/or when using the cloud will mostly run in a

Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

2020-01-24 Thread Thomas Strunz
Hi Maciek,

yeah I thought that this could be the issue as well but according to the tools 
(grep flags /proc/cpuinfo | uniq) or coreinfo on windows the VMs also support 
sse4.2 (and lower) and AVX.

In fact I seem to have to look further as I noticed that in general python 
performance (and possible more, not tested) is much slower on the VMs. See 
below code which is actually a way to see performance impact of vector 
extension and especially of intel mkl.

import numpy as np
import time

n = 2
A = np.random.randn(n,n).astype('float64')
B = np.random.randn(n,n).astype('float64')

start_time = time.time()
nrm = np.linalg.norm(A@B)
print(" took {} seconds ".format(time.time() - start_time))
print(" norm = ",nrm)


Last code fragment runs about 50% slower on the Windows VM compared to my 
laptop accounting for clock and core count differences. It's confusing to me as 
the performance difference is so consitent and apparent but I would assume if 
this was normal people would have noticed a long time ago?Yet I can't find 
anything about it. Or does everyone run their code native?

Best Regards,

Thomas


Von: Maciek Wójcikowski 
Gesendet: Donnerstag, 23. Januar 2020 11:04
An: Thomas Strunz 
Cc: Greg Landrum ; rdkit-discuss@lists.sourceforge.net 

Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: 
PatternFingerprinter, Windows, Linux and Virtual machines

Thomas,

Could you double check if your VM has the same set of instructions as your 
host? For hardware popcounts, which are used to accelerate fingerprint 
operations, they might have profound impact on performance. SSE4.2 is probably 
the one that is used in the RDKit (at least this is stated in the code).

For KVM https://www.linux-kvm.org/page/Tuning_KVM (there are linux commands to 
check what is available on guest, so might be helpful for you too).
It also seems that in VMWare world this might be tricky, as it is considered to 
be a stability hazard: 
https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vcenterhost.doc_50%2FGUID-8B226625-4923-410C-B7AF-51BCD2806A3B.html

Best,
Maciek


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl<mailto:mac...@wojcikowski.pl>


czw., 23 sty 2020 o 08:15 Thomas Strunz 
mailto:beginn...@hotmail.de>> napisał(a):
Hi Greg,

reopening this old question. I can see that there are potential differences 
between rdkit version and especially Linux and Windows but let's lieave that 
aside for now.

After further "playing around" however I really have the impression there is a 
real issue with running rdkit (or python?) in a virtualized operating sytem. 
Since most production software and/or when using the cloud will mostly run in a 
virtualized operating system, I think this should be a fairly relevant topic 
worth investigation. As you showed yourself, the AWS System also was fairly 
slow.

For following observations I'm keeping the same datasets as before which is 
from your blog post ( /Regress/Scripts/fingerprint_screenout.py). basically 
it's that code slightly adapted:

mols = []
with gzip.open(data_dir + 'chembl21_25K.pairs.txt.gz', 'rb') as inf:
for line in inf:
line = line.decode().strip().split()
smi1 = line[1]
smi2 = line[3]
m1 = Chem.MolFromSmiles(smi1)
m2 = Chem.MolFromSmiles(smi2)
mols.append(m1)
mols.append(m2)

frags = [Chem.MolFromSmiles(x.split()[0]) for x in open(data_dir + 
'zinc.frags.500.q.smi', 'r')]

mfps = [Chem.PatternFingerprint(m, 512) for m in mols]
fragsfps = [Chem.PatternFingerprint(m, 512) for m in frags]

%%timeit -n1 -r1
for i, fragfp in enumerate(fragsfps):
hits = 0
for j, mfp in enumerate(mfps):
if DataStructs.AllProbeBitsMatch(fragfp, mfp):
if mols[j].HasSubstructMatch(frags[i]):
hits = hits + 1


I want to focus on the last cell and namley the "AllProbeBitsMatch" method:

%%timeit
DataStructs.AllProbeBitsMatch(fragsfps[10], mfps[10])

Results:

Windows 10 native i7-8850H: 567 
ns ± 5.48 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Lubuntu 16.04 virtualized i7-8850H: 1.81 µs 
± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) // the high 
variation is consistent
Windows Server 2012 R2 virtualized Xeon E5-2620 v4:1.18 µs ± 4.09 ns per 
loop (mean ± std. dev. of 7 runs, 100 loops each)

So it seems virtualization seems to reduce  the performance of this specific 
method by half which is also what I see by running the full substructure search 
code which takes double the time on the virtualized machines. (The windows 
server actually runs on ESX (eg type 1 hypervisor) while the Lubuntu VM is a 
type 2 (Vmware workstation) but both seem to suffer the same.).

we can try same thing with

%%timeit
mols[10].HasSubstructMatc

Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

2020-01-22 Thread Thomas Strunz
Hi Greg,

reopening this old question. I can see that there are potential differences 
between rdkit version and especially Linux and Windows but let's lieave that 
aside for now.

After further "playing around" however I really have the impression there is a 
real issue with running rdkit (or python?) in a virtualized operating sytem. 
Since most production software and/or when using the cloud will mostly run in a 
virtualized operating system, I think this should be a fairly relevant topic 
worth investigation. As you showed yourself, the AWS System also was fairly 
slow.

For following observations I'm keeping the same datasets as before which is 
from your blog post ( /Regress/Scripts/fingerprint_screenout.py). basically 
it's that code slightly adapted:

mols = []
with gzip.open(data_dir + 'chembl21_25K.pairs.txt.gz', 'rb') as inf:
for line in inf:
line = line.decode().strip().split()
smi1 = line[1]
smi2 = line[3]
m1 = Chem.MolFromSmiles(smi1)
m2 = Chem.MolFromSmiles(smi2)
mols.append(m1)
mols.append(m2)

frags = [Chem.MolFromSmiles(x.split()[0]) for x in open(data_dir + 
'zinc.frags.500.q.smi', 'r')]

mfps = [Chem.PatternFingerprint(m, 512) for m in mols]
fragsfps = [Chem.PatternFingerprint(m, 512) for m in frags]

%%timeit -n1 -r1
for i, fragfp in enumerate(fragsfps):
hits = 0
for j, mfp in enumerate(mfps):
if DataStructs.AllProbeBitsMatch(fragfp, mfp):
if mols[j].HasSubstructMatch(frags[i]):
hits = hits + 1


I want to focus on the last cell and namley the "AllProbeBitsMatch" method:

%%timeit
DataStructs.AllProbeBitsMatch(fragsfps[10], mfps[10])

Results:

Windows 10 native i7-8850H: 567 
ns ± 5.48 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Lubuntu 16.04 virtualized i7-8850H: 1.81 µs 
± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) // the high 
variation is consistent
Windows Server 2012 R2 virtualized Xeon E5-2620 v4:1.18 µs ± 4.09 ns per 
loop (mean ± std. dev. of 7 runs, 100 loops each)

So it seems virtualization seems to reduce  the performance of this specific 
method by half which is also what I see by running the full substructure search 
code which takes double the time on the virtualized machines. (The windows 
server actually runs on ESX (eg type 1 hypervisor) while the Lubuntu VM is a 
type 2 (Vmware workstation) but both seem to suffer the same.).

we can try same thing with

%%timeit
mols[10].HasSubstructMatch(frags[10])

The difference here is smaller but VMs also take >50% more time.

So there seems to be a consistent large performance impact in VMs.

Of course the VM will be a bit slower but not by that much? What am I missing? 
Other experiences?

Best Regards,

Thomas

Von: Greg Landrum 
Gesendet: Montag, 16. Dezember 2019 17:10
An: Thomas Strunz 
Cc: rdkit-discuss@lists.sourceforge.net 
Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: 
PatternFingerprinter, Windows, Linux and Virtual machines

Hi Thomas,

First it is important to compare equivalent major versions to each other. 
Particularly in this case. On my linux box generating the pattern fingerprints 
takes 24.2 seconds with v2019.03.x and 15.9 seconds with v2019.09.x (that's due 
to the improvements in the substructure matcher that the blog post you link to 
discusses).

Comparing the same versions to each other:

Performance on windows vs linux
Windows performance with the RDKit has always lagged behind linux performance. 
There's something in the code (or in the way we use the compiler) that leads to 
big differences on some benchmarks. The most straightforward way I can 
demonstrate this is with results from my windows 10 laptop.
Here's the output when running the fingerprint_screenout.py benchmark using the 
windows build:
| 2019.09.1 | 13.6 | 0.3 | 38.1 | 0.8 | 25.5 | 25.9 | 84.1 |
and here's the output from a linux build running on the Windows Linux Subsystem:
| 2019.09.2 | 10.7 | 0.2 | 19.3 | 0.4 | 19.4 | 19.2 | 53.2 |
You can see the differences are not small.
I haven't invested massive time into it, but I haven't been able to figure out 
what causes this.

Performance on (linux) VMs
I can't think of any particular reason why there should be huge differences and 
it's really difficult to compare apples to apples here.
Since I have the numbers, here's one comparison

Here's a run on my linux workstation:
| 2019.09.2 | 7.6 | 0.3 | 15.9 | 0.4 | 21.4 | 20.4 | 55.7 |
and here's the same thing on an AWS t3.xlarge instance:
| 2019.09.2 | 9.6 | 0.2 | 20.3 | 0.4 | 38.4 | 38.2 | 94.7 |
The VM is significantly slower, but t3.xlarge an instance type that's intended 
to be used for compute intensive jobs (I don't have on of those active and 
configured at the moment).

Does that help at all?
-greg


On Mon, Dec 16, 2019 a

[Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

2019-12-15 Thread Thomas Strunz
Hi All,

I was looking at a blog post from greg:

https://rdkit.blogspot.com/2019/07/a-couple-of-substructure-search-topics.html

about fingerprint screenout. The part that got me confused was the timings in 
his blog post because run times in my case where a lot slower.

Gregs numbers:


[07:21:19] INFO: mols from smiles
[07:21:27] INFO: Results1:  7.77 seconds, 5 mols
[07:21:27] INFO: queries from smiles
[07:21:27] INFO: Results2:  0.16 seconds
[07:21:27] INFO: generating pattern fingerprints for mols
[07:21:43] INFO: Results3:  16.11 seconds
[07:21:43] INFO: generating pattern fingerprints for queries
[07:21:43] INFO: Results4:  0.34 seconds
[07:21:43] INFO: testing frags queries
[07:22:03] INFO: Results5:  19.90 seconds. 6753 tested (0.0003 of total), 3989 
found,  0.59 accuracy. 0 errors.
[07:22:03] INFO: testing leads queries
[07:22:23] INFO: Results6:  19.77 seconds. 1586 tested (0.0001 of total), 1067 
found,  0.67 accuracy. 0 errors.
[07:22:23] INFO: testing pieces queries
[07:23:19] INFO: Results7:  55.37 seconds. 202 tested (0.0810 of total), 
1925628 found,  0.58 accuracy. 0 errors.

| 2019.09.1dev1 | 7.8 | 0.2 | 16.1 | 0.3 | 19.9 | 19.8 | 55.4 |



Machine 1:
Virtual machine, Windows Server 2012 R2 with an intel xeon (4 virtual cores)

Since the test is single-threaded it makes a bit of sense that it isn't fast 
here but it's not just a bit slower, but a lot slower, depending on test almost 
3xtimes slower

[09:03:19] INFO: mols from smiles
[09:03:38] INFO: Results1:  19.44 seconds, 5 mols
[09:03:38] INFO: queries from smiles
[09:03:38] INFO: Results2:  0.36 seconds
[09:03:38] INFO: generating pattern fingerprints for mols
[09:04:54] INFO: Results3:  75.99 seconds
[09:04:54] INFO: generating pattern fingerprints for queries
[09:04:56] INFO: Results4:  1.55 seconds
[09:04:56] INFO: testing frags queries
[09:05:34] INFO: Results5:  37.59 seconds. 6753 tested (0.0003 of total), 3989 f
ound,  0.59 accuracy. 0 errors.
[09:05:34] INFO: testing leads queries
[09:06:11] INFO: Results6:  37.34 seconds. 1586 tested (0.0001 of total), 1067 f
ound,  0.67 accuracy. 0 errors.
[09:06:11] INFO: testing pieces queries
[09:08:39] INFO: Results7:  147.79 seconds. 202 tested (0.0810 of total), 19
25628 found,  0.58 accuracy. 0 errors.
| 2019.03.3 | 19.4 | 0.4 | 76.0 | 1.5 | 37.6 | 37.3 | 147.8 |

I thought maybe another issue with windows being slow so I tested on a linux VM 
on my laptop

Machine 2:
Virtual machine, Lubuntu 16.04 on a laptop i7-8850H 6-core

[09:23:31] INFO: mols from smiles
[09:23:54] INFO: Results1:  23.71 seconds, 5 mols
[09:23:54] INFO: queries from smiles
[09:23:55] INFO: Results2:  0.48 seconds
[09:23:55] INFO: generating pattern fingerprints for mols
[09:24:53] INFO: Results3:  58.31 seconds
[09:24:53] INFO: generating pattern fingerprints for queries
[09:24:54] INFO: Results4:  1.19 seconds
[09:24:54] INFO: testing frags queries
[09:25:41] INFO: Results5:  46.22 seconds. 6753 tested (0.0003 of total), 3989 
found,  0.59 accuracy. 0 errors.
[09:25:41] INFO: testing leads queries
[09:26:26] INFO: Results6:  45.84 seconds. 1586 tested (0.0001 of total), 1067 
found,  0.67 accuracy. 0 errors.
[09:26:26] INFO: testing pieces queries
[09:28:33] INFO: Results7:  126.78 seconds. 202 tested (0.0810 of total), 
1925628 found,  0.58 accuracy. 0 errors.
| 2019.03.3 | 23.7 | 0.5 | 58.3 | 1.2 | 46.2 | 45.8 | 126.8 |

Pretty weird sometimes even slower sometimes faster than the windows VM but 
still a lot slower than Gregs numbers (I repeated with rdkit 2019.09.2 and got 
comparable results)

So I also tested on above laptop directly:

Machine 3:
physical install, windows 10 on a laptop i7-8850H 6-core (same machine as 2)

[09:51:43] INFO: mols from smiles
[09:51:54] INFO: Results1:  10.59 seconds, 5 mols
[09:51:54] INFO: queries from smiles
[09:51:54] INFO: Results2:  0.20 seconds
[09:51:54] INFO: generating pattern fingerprints for mols
[09:52:24] INFO: Results3:  29.50 seconds
[09:52:24] INFO: generating pattern fingerprints for queries
[09:52:24] INFO: Results4:  0.61 seconds
[09:52:24] INFO: testing frags queries
[09:52:44] INFO: Results5:  19.71 seconds. 6753 tested (0.0003 of total), 3989 
found,  0.59 accuracy. 0 errors.
[09:52:44] INFO: testing leads queries
[09:53:04] INFO: Results6:  19.48 seconds. 1586 tested (0.0001 of total), 1067 
found,  0.67 accuracy. 0 errors.
[09:53:04] INFO: testing pieces queries
[09:54:05] INFO: Results7:  61.94 seconds. 202 tested (0.0810 of total), 
1925628 found,  0.58 accuracy. 0 errors.
| 2019.09.1 | 10.6 | 0.2 | 29.5 | 0.6 | 19.7 | 19.5 | 61.9 |

This is much closer to Gregs results, except for the fingerprinting which takes 
almost double the time.  Also notice how the fingerprinting on the linux VM is 
much faster also compared to other results than on the windows VM?

Conclusions:

  1.  Form what I see, it seems that the pattern fingerprinter runs a lot 
slower on windows. Is this known issue?
  2.  In virtual machines 

Re: [Rdkit-discuss] Anaconda installation without hard dependency on Intel MKl (windows)

2019-11-19 Thread Thomas Strunz
Hi all,

In the last couple days there has been increased foucs on this on certain 
tech/social media sites (MKL crippling Ryzen) for example matlab is also 
affected. Some of you might have seen it but there seems to be a very simple 
workaround to get MKL to run properly on AMD Ryzen.

One simply needs to create a system environment variable

MKL_DEBUG_CPU_TYPE=5


And then anything using MKL will use AVX2 code path (if applicable) and run 
much faster. faster than with openblas.

Again, no extensive testing done. But this would be in my opinion the simplest 
workaround.

Best Regards,

Thomas

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Anaconda installation without hard dependency on Intel MKl (windows)

2019-11-12 Thread Thomas Strunz
Hi Peter,

good idea. As far as I can tell for know it seems to work fine but again no 
extensive testing done.

There is however 1 issue I encountered  with pillow. I think the conda-forge 
one or a dependency of it is broken. Eg. pillow must be installed from defaults 
manually and not via dependency of rdkit. If the conda-forge one is used, there 
is an error of missing dll when displaying molecules in jupyter. So 
installation order was as below:

conda create --name rdkit_forge python=3.7
conda activate rdkit_forge
conda install pillow
conda install -c conda-forge ipykernel openblas numpy pandas rdkit 
"libblas=*=*openblas"

For scipy and hence also scikit-learn one still needs to use pip as there is 
only an mkl build available even in conda-forge. Also xgboost will need to come 
from pip as there is no windows build on conda-forge and the one on defaults 
isn't compatible with openblas. There are probably more such issues but I guess 
still better than getting everything but rdkit from pypi.

Best Regards,

Thomas



Von: Peter St. John 
Gesendet: Dienstag, 12. November 2019 16:25
An: Greg Landrum 
Cc: Thomas Strunz ; rdkit-discuss@lists.sourceforge.net 

Betreff: Re: [Rdkit-discuss] Anaconda installation without hard dependency on 
Intel MKl (windows)

Another option would be to try the conda-forge rdkit. It doesn't appear to use 
MKL -- I think the MKL dependency for the rdkit::rdkit package is coming from 
the defaults::numpy dependency.

some tools for example scipy and pandas are only available as openblas builds 
via pypi (pip).
I believe the conda-forge recipes for these are also based on openblas. You 
might try installing rdkit from conda-forge in a fresh environment and see if 
those numpy / scipy builds work for you.

-- Peter

On Tue, Nov 12, 2019 at 6:48 AM Greg Landrum 
mailto:greg.land...@gmail.com>> wrote:


On Tue, Nov 12, 2019 at 2:00 PM Thomas Strunz 
mailto:beginn...@hotmail.de>> wrote:

So for me this is temporary workaround but not really a permanent long term 
solution (and as far as I can tell mostly an issue of conda and windows and not 
rdkit)

Yeah, it's clearly not the idea solution to the problem. And, yes, the problem 
is clearly related to conda and windows. It seems that there used to be a blog 
post explaining this (linked from this github issue: 
https://github.com/ContinuumIO/anaconda-issues/issues/656), but the URL no 
longer works and I can't find it.
If the performance difference is that dramatic on AMD CPUs, it may be worth 
raising a new issue in the repo above and see if you get any kind of response.

-greg



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Anaconda installation without hard dependency on Intel MKl (windows)

2019-11-12 Thread Thomas Strunz
Hi Greg,

great. Thanks for these commands. This leads to below yml (edited to include 
pip) which I came up with through looking at the conda package meta.yml file.

name: rdkit_openblas
channels:
  - defaults
dependencies:
  - python=3.7
  - libboost
  - py-boost
  - pillow
  - cairo
  - freetype
  - eigen
  - rdkit
  - ipykernel
  - pip:
- pandas

I suspect just installing from this yml won't work as some of these libs have 
unlisted dependencies while for rdkit you want no deps. So below the exact 
non-optimized order of commands used:

conda create -n rdkit_openblas python=3.7
conda activate rdkit_openblas
pip install pandas
conda install --no-deps libboost py-boost
conda install cairo freetype pillow
conda install eigen
conda install --no-deps -c rdkit rdkit
conda install ipykernel

(pandas also installs numpy from pypi as dependency)

Further testing to be done but a quick sanity checks looks like it works (I 
attached a notebook, not sure how mailing list deals with that). The notebook 
also checks that numpy really isn't using mkl and checks the speed (A 6-core 
ryzen3600 should for sure be below 100 seconds in this test or else something 
is wrong).

This kind of works but installing any additional package will have similar 
issues. So scipy and scikit-learn most likely must also be installed via pip 
and for any other install or updates one must check the install list to check 
if mkl doesn't want to sneak back in. And if you want to add tensorflow-gpu on 
top not sure how that will work. Plus updating this env is probably also big 
problem.

So for me this is temporary workaround but not really a permanent long term 
solution (and as far as I can tell mostly an issue of conda and windows and not 
rdkit)

Best regards,

Thomas


Von: Greg Landrum 
Gesendet: Dienstag, 12. November 2019 11:06
An: Thomas Strunz 
Cc: rdkit-discuss@lists.sourceforge.net 
Betreff: Re: [Rdkit-discuss] Anaconda installation without hard dependency on 
Intel MKl (windows)


Sorry I missed that you were on windows.

It looks like you can probably carefully construct an environment manually 
using the '--no-deps' argument to "conda install"

I created and activated a python 3.7 environment on windows, installed pandas 
and numpy from pip, and then did:

conda install --no-deps libboost py-boost
conda install --no-deps -c rdkit rdkit

That got me to the point that I can at least import the rdkit and rdkit.Chem, 
but there are almost certainly some missing dependencies that you'll have to 
install (either from conda if possible or from pip if not).

Note that once you've gone through the dance of making this work, you should be 
able to can easily re-create the environment 
http://rdkit.blogspot.com/2019/10/sharing-conda-environments.html

-greg

On Tue, Nov 12, 2019 at 10:43 AM Thomas Strunz 
mailto:beginn...@hotmail.de>> wrote:
Hi Greg,

thanks for your quick reply.

The main problem is windows. This doesn't work on windows (one needs to add -c 
anaconda to your command so that nomkl is found) but then this is the result:

[cid:16e5f0d84c5cb971f161]

eg. it still wants to install mkl together with nomkl.

best regards,

Thomas




Von: Greg Landrum mailto:greg.land...@gmail.com>>
Gesendet: Dienstag, 12. November 2019 10:13
An: Thomas Strunz mailto:beginn...@hotmail.de>>
Cc: 
rdkit-discuss@lists.sourceforge.net<mailto:rdkit-discuss@lists.sourceforge.net> 
mailto:rdkit-discuss@lists.sourceforge.net>>
Betreff: Re: [Rdkit-discuss] Anaconda installation without hard dependency on 
Intel MKl (windows)

Hi Thomas,

I'm not sure how to configure conda so that a pip-installed version of numpy 
and/or pandas is used, but you can use conda versions without MKL by installing 
the nomkl package.

This conda command creates a functioning environment that does not have the MKL 
installed:
conda create -n no_mkl python=3.7 nomkl numpy pandas scikit-learn rdkit::rdkit

(no_mkl) glandrum@otter:~$ python -c 'import rdkit;print(rdkit.__version__)'
2019.09.1

Does that help?
-greg


On Tue, Nov 12, 2019 at 9:53 AM Thomas Strunz 
mailto:beginn...@hotmail.de>> wrote:
Dear all,

would it be possible to make RDKit package not depend on mkl (probably via 
numpy, pandas) and that it accepts pre-installed numpy and pandas for example 
from pip as sufficient?

The background to this is simple. Intel MKL cripples performance on any AMD 
bases processor (4-5 times slower, SSE vs AVX2). Since these AMD CPUs are now 
actually competitive again, this poses an issue. On Linux it's simple to just 
link to openblas and it works as far as I can tell just fine.

On Windows the issue is that some tools for example scipy and pandas are only 
available as openblas builds via pypi (pip). I'm not sure where the issue is 
but when I then want to install rdkit into such an environment it doesn't 
"accept" the preinstalled numpy and pandas from 

Re: [Rdkit-discuss] Anaconda installation without hard dependency on Intel MKl (windows)

2019-11-12 Thread Thomas Strunz
Hi Greg,

thanks for your quick reply.

The main problem is windows. This doesn't work on windows (one needs to add -c 
anaconda to your command so that nomkl is found) but then this is the result:

[cid:d11d4da5-0d46-462c-860b-bb266f8e84b8]

eg. it still wants to install mkl together with nomkl.

best regards,

Thomas




Von: Greg Landrum 
Gesendet: Dienstag, 12. November 2019 10:13
An: Thomas Strunz 
Cc: rdkit-discuss@lists.sourceforge.net 
Betreff: Re: [Rdkit-discuss] Anaconda installation without hard dependency on 
Intel MKl (windows)

Hi Thomas,

I'm not sure how to configure conda so that a pip-installed version of numpy 
and/or pandas is used, but you can use conda versions without MKL by installing 
the nomkl package.

This conda command creates a functioning environment that does not have the MKL 
installed:
conda create -n no_mkl python=3.7 nomkl numpy pandas scikit-learn rdkit::rdkit

(no_mkl) glandrum@otter:~$ python -c 'import rdkit;print(rdkit.__version__)'
2019.09.1

Does that help?
-greg


On Tue, Nov 12, 2019 at 9:53 AM Thomas Strunz 
mailto:beginn...@hotmail.de>> wrote:
Dear all,

would it be possible to make RDKit package not depend on mkl (probably via 
numpy, pandas) and that it accepts pre-installed numpy and pandas for example 
from pip as sufficient?

The background to this is simple. Intel MKL cripples performance on any AMD 
bases processor (4-5 times slower, SSE vs AVX2). Since these AMD CPUs are now 
actually competitive again, this poses an issue. On Linux it's simple to just 
link to openblas and it works as far as I can tell just fine.

On Windows the issue is that some tools for example scipy and pandas are only 
available as openblas builds via pypi (pip). I'm not sure where the issue is 
but when I then want to install rdkit into such an environment it doesn't 
"accept" the preinstalled numpy and pandas from pip and wants to install the 
conda version based on mkl again maybe due to the package dependencies.

Can this be changed so that when installing rdkit it accepts numpy and pandas 
install from pypi?

thanks for any help.

Best regards,

Thomas



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Anaconda installation without hard dependency on Intel MKl (windows)

2019-11-12 Thread Thomas Strunz
Dear all,

would it be possible to make RDKit package not depend on mkl (probably via 
numpy, pandas) and that it accepts pre-installed numpy and pandas for example 
from pip as sufficient?

The background to this is simple. Intel MKL cripples performance on any AMD 
bases processor (4-5 times slower, SSE vs AVX2). Since these AMD CPUs are now 
actually competitive again, this poses an issue. On Linux it's simple to just 
link to openblas and it works as far as I can tell just fine.

On Windows the issue is that some tools for example scipy and pandas are only 
available as openblas builds via pypi (pip). I'm not sure where the issue is 
but when I then want to install rdkit into such an environment it doesn't 
"accept" the preinstalled numpy and pandas from pip and wants to install the 
conda version based on mkl again maybe due to the package dependencies.

Can this be changed so that when installing rdkit it accepts numpy and pandas 
install from pypi?

thanks for any help.

Best regards,

Thomas



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] New Drawing code: Fixed sized molecules

2018-06-14 Thread Thomas Strunz
I now was able to adapt the GitHub c++ example to Python with some adaption 
(enhancement).


The code in the sample on GitHub generates images of different sizes depending 
on the molecule. For my use-case the image size should be fixed (else I will 
get layout /design issues) but the molecule should take up only as much space 
as needed (depending on the desired size, dots per angstrom for example) and if 
it is too large it should shrink to fit the image and also center it.


Here the code ( 2 functions from my class):


def molToImage(self, mol, image_width = 300, image_height = 150, dpa=30, 
kekulize=True):
mol = rdMolDraw2D.PrepareMolForDrawing(mol, kekulize=kekulize)
w, h, minV, maxV = self.getScaleForDrawing(mol, image_width, 
image_height, dpa)
drawer = rdMolDraw2D.MolDraw2DSVG(image_width, image_height) #fixed 
image size
drawer.SetScale(w, h, minV, maxV)
# re-center image
x_offset = int((image_width - w) / 2)
y_offset = int((image_height - h) / 2)
drawer.SetOffset(x_offset, y_offset)
#draw
drawer.DrawMolecule(mol)
drawer.FinishDrawing()
svg = drawer.GetDrawingText()
# It seems that the svg renderer used doesn't quite hit the spec.
# Here are some fixes to make it work in the notebook, although I think
# the underlying issue needs to be resolved at the generation step
return svg.replace('svg:','')




def getScaleForDrawing(self, mol, image_width, image_height, dpa):

minV = Point2D()
maxV = Point2D()
cnf = mol.GetConformer()

minV.x = maxV.x = cnf.GetAtomPosition(0).x
minV.y = maxV.y = cnf.GetAtomPosition(0).y

for i in range(mol.GetNumAtoms()):
minV.x = min(minV.x, cnf.GetAtomPosition(i).x)
minV.y = min(minV.y, cnf.GetAtomPosition(i).y)
maxV.x = max(maxV.x, cnf.GetAtomPosition(i).x)
maxV.y = max(maxV.y, cnf.GetAtomPosition(i).y)

w = int(dpa * (maxV.x - minV.x))
h = int(dpa * (maxV.y - minV.y))

#shrink to fit
if w > image_width or h > image_height:

rw = w/image_width
rh = h/image_height

ratio = max(rw,rh)

w = int(w/ratio)
h = int(h/ratio)
return (w, h, minV, maxV)


Best Regards,

Thomas



Von: Thomas Strunz 
Gesendet: Mittwoch, 13. Juni 2018 14:28
An: rdkit-discuss@lists.sourceforge.net
Betreff: [Rdkit-discuss] New Drawing code: Fixed sized molecules


When using the "new" drawing code according to


http://rdkit.blogspot.com/2015/02/new-drawing-code.html<http://rdkit.blogspot.com/2015/02/new-drawing-code.htm>


I also want to be able to control the size of the molecule (not image) so if 
for example I have to depict multiple molecules smaller ones are not drawn in 
an oversized fashion. Therefore I would want to control the font size and bond 
length, somewhat similar to shown here:


https://iwatobipen.wordpress.com/2017/11/03/draw-high-quality-molecular-image-in-rdkit-rdkit/


Also this would be for a web service hence drawer.GetDrawingText() is for sure 
preferred vs Draw.MolToFile and then reading in the temp file again.


I did find this sample showing this:


https://github.com/rdkit/rdkit/pull/1355/commits/8141baaa7a990e68632ab1b8445671fbcc3ca2f6


But it's in c++ and not python and not very understandable. So how can this be 
done in Python? Is there a convenience function i haven't found yet?

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] New Drawing code: Fixed sized molecules

2018-06-13 Thread Thomas Strunz
When using the "new" drawing code according to


http://rdkit.blogspot.com/2015/02/new-drawing-code.html


I also want to be able to control the size of the molecule (not image) so if 
for example I have to depict multiple molecules smaller ones are not drawn in 
an oversized fashion. Therefore I would want to control the font size and bond 
length, somewhat similar to shown here:


https://iwatobipen.wordpress.com/2017/11/03/draw-high-quality-molecular-image-in-rdkit-rdkit/


Also this would be for a web service hence drawer.GetDrawingText() is for sure 
preferred vs Draw.MolToFile and then reading in the temp file again.


I did find this sample showing this:


https://github.com/rdkit/rdkit/pull/1355/commits/8141baaa7a990e68632ab1b8445671fbcc3ca2f6


But it's in c++ and not python and not very understandable. So how can this be 
done in Python? Is there a convenience function i haven't found yet?

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] 3D alignment in Python: align conformers of 2 molecules

2014-07-01 Thread Thomas Strunz
Hi all,

here the code for Open3DALIGN:

cids = generateConformers(mol, numConformers)
prbPyMP = AllChem.MMFFGetMoleculeProperties(mol)  

maxScore = 0;  
for refCid in refCids:
for cid in cids:
alignment = AllChem.GetO3A(mol, refMol, prbPyMP, refPyMP, 
prbCid=cid, refCid=refCid,)
score = alignment.Score()
logger.debug('Score: %.2f', score)
if score  maxScore:
logger.info('New max. Score: %.2f', score)
maxScore = score
refConformerId = refCid
molCid = cid

alignment = AllChem.GetO3A(mol, refMol, prbPyMP, refPyMP, 
prbCid=molCid, refCid=refConformerId)
alignment.Align()
# show in PyMol

One question I have is how I can align this to a specific fragment or the MCS? 
I tried using constraintMap but for some reason o3a does not want to align the 
2 structure on it. I suspect because then the rest of the alignment will not be 
very good in that case.

The code:

constraintMap = []
constraintWeights = []
mcs = MCS.FindMCS([refMol,mol], bondCompare='bondtypes', 
ringMatchesRingOnly=False)
if mcs.completed == 1 and mcs.numAtoms  0:
 
core = Chem.MolFromSmarts(mcs.smarts) # or use specific smarts 
pattern through argument option
logger.info('MCS: %s', Chem.MolToSmiles(core))

refMatch = refMol.GetSubstructMatch(core) 
match = mol.GetSubstructMatch(core)   

for idx, val in enumerate(match):
constraintMap.append((val, refMatch[idx]))
constraintWeights.append(100.0)
logger.info(constraintMap)
...
alignment = AllChem.GetO3A(mol, refMol, prbPyMP, refPyMP,  prbCid=cid, 
refCid=refCid, constraintMap=constraintMap,  
constraintWeights=constraintWeights)

The alinment is different but it still does not want to align to my defined 
fragment. even if I set constraintWeights to very high value. In my case I 
especially want to align heteroatoms properly. For that I tried:

for idx, val in enumerate(match):
constraintMap.append((val, refMatch[idx]))
if mol.GetAtomWithIdx(val).GetAtomicNum() != 6:
constraintWeights.append(1000.0)
logger.info('Set weigth to 1000')
else:
constraintWeights.append(10.0)

But the single hetero atoms in this my test case are still not aligned on each 
other.  I also tried ConstrainedEmbed (or how it is called) but that results in 
either an error or with relaxed parameters in a completely useless 
conformation, namely on a ring.

What are my options?
From: greg.land...@gmail.com
Date: Fri, 27 Jun 2014 08:20:56 +0200
Subject: Re: [Rdkit-discuss] 3D alignment in Python: align conformers of 2 
molecules
To: beginn...@hotmail.de
CC: rdkit-discuss@lists.sourceforge.net



On Fri, Jun 27, 2014 at 7:57 AM, Thomas Strunz beginn...@hotmail.de wrote:






thanks for your quick reply. This helped to improve the alignment.


I'm glad to hear it! 

How can I reproduce the alignment done in with the Open3DAlign Node in Python? 
Is it possible at all?

But of course. :-)There's some example code that shows how to do it on page 37 
of Paolo's presentation from the last RDKit UGM:

https://github.com/rdkit/UGM_2013/raw/master/Presentations/Tosco.RDKit_UGM2013.pdf

-greg

 
  --
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] 3D alignment in Python: align conformers of 2 molecules

2014-06-27 Thread Thomas Strunz
Hi Greg,

thanks for your quick reply. This helped to improve the alignment.

How can I reproduce the alignment done in with the Open3DAlign Node in Python? 
Is it possible at all?

Best Regards,

Thomas

From: greg.land...@gmail.com
Date: Fri, 27 Jun 2014 05:10:42 +0200
Subject: Re: [Rdkit-discuss] 3D alignment in Python: align conformers of 2 
molecules
To: beginn...@hotmail.de
CC: rdkit-discuss@lists.sourceforge.net

Hi Thomas,
I think there are a couple of problems with the code here:1) you aren't storing 
the confirmation of the conformer of mol that produces the best alignment 
(cid in the above code)

2) you aren't keeping the best alignment since you repeatedly run the same 
molecule instances through the alignment code. You could fix this by, after the 
double conformer loop, adding something like:
   AllChem.AlignMol(refMol, mol, prbCid=refConformerId, refCid=molConformerId, 
atomMap=zip(refMatch,match))
where I have assumed that molConformerId is the variable you use to solve 
problem 1.
The loop  would become something like this:

minRmsd = 1000;  for refCid in refCids:
for cid in cids:

rmsd = AllChem.AlignMol(refMol, mol, prbCid=refCid, refCid=cid, 
atomMap=zip(refMatch,match))if rmsd  minRmsd:

minRmsd = rmsdrefConformerId = refCid   
 molConformerId = cid


rmsd = AllChem.AlignMol(refMol, mol, prbCid=refConformerId, 
refCid=molConformerId,
atomMap=zip(refMatch,match))



It's also not really right to compare the results of this, which uses the MCS 
code to find a common set of atoms and then does an RMSD alignment of those, 
and the Open3DAlign results, which use a fuzzier scheme to identify the 
atom-atom mapping between the molecules.


-greg









On Thu, Jun 26, 2014 at 2:32 PM, Thomas Strunz beginn...@hotmail.de wrote:





I'm trying to align all conformers of 2 molecules (and keep the best ones) 
using the python api by following some of the tutorials:

http://nbviewer.ipython.org/gist/greglandrum/4316435/Working%20in%203D.ipynb



http://nbviewer.ipython.org/github/greglandrum/rdkit_blog/blob/master/notebooks/Using%20ConstrainedEmbed.ipynb



However whatever I try the alignment is considerably different than the one 
created using the Open 3D alignment node in Knime.

Currently the issue is that the molecules do not seem to be aligned properly. 
There is some sort of small shift. See code below.



Best Regards,

Thomas

refCids = generateConformers(refMol, numConformers)
mcs = MCS.FindMCS([refMol,mol], ringMatchesRingOnly=matchesRingOnly)

if mcs.completed == 1 and mcs.numAtoms  0:


core = Chem.MolFromSmarts(mcs.smarts)
logger.info('MCS: %s', Chem.MolToSmiles(core))

refMatch = refMol.GetSubstructMatch(core) 


match = mol.GetSubstructMatch(core)   

# conformers for current target
cids = generateConformers(mol, numConformers, coordMap=coordMap)




minRmsd = 1000;  
for refCid in refCids:
for cid in cids:
rmsd = AllChem.AlignMol(refMol, mol, prbCid=refCid, 
refCid=cid, atomMap=zip(refMatch,match))


logger.debug('RMSD: %.2f', rmsd)
if rmsd  minRmsd:
logger.debug('New min RMSD: %.2f', rmsd)
minRmsd = rmsd


refConformerId = refCid


def generateConformers(mol, numConformers):
AllChem.EmbedMolecule(mol)
AllChem.MMFFOptimizeMolecule(mol)
cids=AllChem.EmbedMultipleConfs(mol, numConfs=numConformers, 
maxAttempts=50, pruneRmsThresh=0.5, coordMap=coordMap)


for cid in cids: AllChem.MMFFOptimizeMolecule(mol,confId=cid)
return cids



  

--

Open source business process management suite built on Java and Eclipse

Turn processes into business applications with Bonita BPM Community Edition

Quickly connect people, data, and systems into organized workflows

Winner of BOSSIE, CODIE, OW2 and Gartner awards

http://p.sf.net/sfu/Bonitasoft
___

Rdkit-discuss mailing list

Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



  --
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu

[Rdkit-discuss] 3D alignment in Python: align conformers of 2 molecules

2014-06-26 Thread Thomas Strunz
I'm trying to align all conformers of 2 molecules (and keep the best ones) 
using the python api by following some of the tutorials:

http://nbviewer.ipython.org/gist/greglandrum/4316435/Working%20in%203D.ipynb

http://nbviewer.ipython.org/github/greglandrum/rdkit_blog/blob/master/notebooks/Using%20ConstrainedEmbed.ipynb

However whatever I try the alignment is considerably different than the one 
created using the Open 3D alignment node in Knime.

Currently the issue is that the molecules do not seem to be aligned properly. 
There is some sort of small shift. See code below.

Best Regards,

Thomas

refCids = generateConformers(refMol, numConformers)
mcs = MCS.FindMCS([refMol,mol], ringMatchesRingOnly=matchesRingOnly)

if mcs.completed == 1 and mcs.numAtoms  0:
core = Chem.MolFromSmarts(mcs.smarts)
logger.info('MCS: %s', Chem.MolToSmiles(core))

refMatch = refMol.GetSubstructMatch(core) 
match = mol.GetSubstructMatch(core)   

# conformers for current target
cids = generateConformers(mol, numConformers, coordMap=coordMap)


minRmsd = 1000;  
for refCid in refCids:
for cid in cids:
rmsd = AllChem.AlignMol(refMol, mol, prbCid=refCid, 
refCid=cid, atomMap=zip(refMatch,match))
logger.debug('RMSD: %.2f', rmsd)
if rmsd  minRmsd:
logger.debug('New min RMSD: %.2f', rmsd)
minRmsd = rmsd
refConformerId = refCid


def generateConformers(mol, numConformers):
AllChem.EmbedMolecule(mol)
AllChem.MMFFOptimizeMolecule(mol)
cids=AllChem.EmbedMultipleConfs(mol, numConfs=numConformers, 
maxAttempts=50, pruneRmsThresh=0.5, coordMap=coordMap)
for cid in cids: AllChem.MMFFOptimizeMolecule(mol,confId=cid)
return cids



  --
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss