Re: [Rdkit-discuss] Using RdKit in Parallel

2019-02-20 Thread Michal Krompiec
Dear Stamatia,
If the molecules are processed completely independently by your code, it
may be simpler to split the SDF into chunks (e.g. with csplit in a bash
script) and then run separate instances of your python code on each chunk,
wait until all are finished and finally collate the output. Thus you can
avoid the problem altogether.
Best,
Michal

On Wed, 20 Feb 2019 at 11:28, Stamatia Zavitsanou <
stamatia.zavitsa...@oriel.ox.ac.uk> wrote:

> Hello everyone,
>
>
> We have been writing a script that searches though a large number of
> molecules within different files for a common substructure. To speed this
> up we have been attempting to run this script in parallel-see scripts
> below. However online the tutorial notes make reference to problems with
> using the SDMolSupplier in parallel, we were wondering what is the issue
> and how we could circumvent them to speed up some of our calculations.
>
>
> Non-parallel
>
>
> from __future__ import print_function
>
> from rdkit import Chem
>
> import os
>
> from progressbar import ProgressBar
>
> pbar=ProgressBar()
>
> matches = []
>
> directory = 'Q:\Data2'
>
> patt = Chem.MolFromSmarts('NC(NNC=O)=O')
>
> for file in pbar(os.listdir(directory)):
>
> filename = os.fsdecode(file)
>
> if filename.endswith(".sdf"):
>
> f = os.path.join(directory,filename)
>
> suppl= Chem.SDMolSupplier(f)
>
> for mol in suppl:
>
> if mol is None: continue
>
> if mol.HasSubstructMatch(patt):
>
> matches.append(mol)
>
> w = Chem.SDWriter(r'C:\Users\tom.watts\Desktop\datasmarts4c.sdf')
>
> for m in matches: w.write(m)
>
> print(filename)
>
>
>
> Parallel
>
>
> pbar=ProgressBar()
>
> matches = []
>
> directory = 'E:\Data'
>
> patt = Chem.MolFromSmarts('NC(NNC=O)=O')
>
> w = Chem.SDWriter(r'C:\Users\tom.watts\Desktop\SearchDataNonly.sdf')
>
> l=[]
>
> for file in pbar(os.listdir(directory)):
>
> filename = os.fsdecode(file)
>
> if filename.endswith(".sdf"):
>
> f = os.path.join(directory,filename)
>
> l.append(f)
>
> num_cores = multiprocessing.cpu_count()
>
> print(num_cores)
>
> lock = multiprocessing.Lock()
>
> def Search(i):
>
> suppl= Chem.SDMolSupplier(i)
>
> for mol in suppl:
>
> if mol is None: continue
>
> if mol.HasSubstructMatch(patt):
>
> matches.append(mol)
>
> return matches
>
> results = Parallel(n_jobs=20)(delayed(Search)(i) for i in l)
>
>
>
> We also wish to use a second script  that opens one SDF file and then
> runs a loop over each molecule in the file. This is currently done
> serially and we were wondering if it could be made parallel.
>
>
>
> suppl = Chem.SDMolSupplier('Red3.sdf')
>
> *for* mol *in* suppl:
>
> patt = Chem.MolFromSmarts('NC(N)=O')
>
> num=mol.GetSubstructMatches(patt)
>
> logger.debug(Chem.MolToSmiles(mol))
>
> h=len(num)
>
> m3=Chem.AddHs(mol)
>
> cids =AllChem.EmbedMultipleConfs(m3, numConfs)
>
>
>
> Any comments can be useful.
>
>
> Thanks a lot,
>
> Stamatia Zavitsanou
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Using RdKit in Parallel

2019-02-20 Thread Christos Kannas
Hi Stamatia,

Yes, SDMolSupplier is not thread safe.
My guess is due to the nature of SDF file where a molecule record needs
multiple lines and you do not know a-priory the number of lines per
molecule in order to split the file to different threads/processes.

Given that your proposed approach is the preferred one.
Process each SDF file and return matched molecules using a separate process.

I would advice to use concurrent.futures (
https://docs.python.org/3/library/concurrent.futures.html) package instead
of multiprocessing.
As it provides an abstraction layer on top of multiprocessing.
See the example on ProcessPoolExecutor.

One important think to remember when returning the list of matched
molecules make use you preserve the molecule objects (
https://rdkit.org/docs/GettingStartedInPython.html#preserving-molecules) as
transferring data between processes in Python requires that the data to be
picklable.

Best,

Christos

Christos Kannas

Cheminformatics Researcher & Software Developer

[image: View Christos Kannas's profile on LinkedIn]



On Wed, 20 Feb 2019 at 11:28, Stamatia Zavitsanou <
stamatia.zavitsa...@oriel.ox.ac.uk> wrote:

> Hello everyone,
>
>
> We have been writing a script that searches though a large number of
> molecules within different files for a common substructure. To speed this
> up we have been attempting to run this script in parallel-see scripts
> below. However online the tutorial notes make reference to problems with
> using the SDMolSupplier in parallel, we were wondering what is the issue
> and how we could circumvent them to speed up some of our calculations.
>
>
> Non-parallel
>
>
> from __future__ import print_function
>
> from rdkit import Chem
>
> import os
>
> from progressbar import ProgressBar
>
> pbar=ProgressBar()
>
> matches = []
>
> directory = 'Q:\Data2'
>
> patt = Chem.MolFromSmarts('NC(NNC=O)=O')
>
> for file in pbar(os.listdir(directory)):
>
> filename = os.fsdecode(file)
>
> if filename.endswith(".sdf"):
>
> f = os.path.join(directory,filename)
>
> suppl= Chem.SDMolSupplier(f)
>
> for mol in suppl:
>
> if mol is None: continue
>
> if mol.HasSubstructMatch(patt):
>
> matches.append(mol)
>
> w = Chem.SDWriter(r'C:\Users\tom.watts\Desktop\datasmarts4c.sdf')
>
> for m in matches: w.write(m)
>
> print(filename)
>
>
>
> Parallel
>
>
> pbar=ProgressBar()
>
> matches = []
>
> directory = 'E:\Data'
>
> patt = Chem.MolFromSmarts('NC(NNC=O)=O')
>
> w = Chem.SDWriter(r'C:\Users\tom.watts\Desktop\SearchDataNonly.sdf')
>
> l=[]
>
> for file in pbar(os.listdir(directory)):
>
> filename = os.fsdecode(file)
>
> if filename.endswith(".sdf"):
>
> f = os.path.join(directory,filename)
>
> l.append(f)
>
> num_cores = multiprocessing.cpu_count()
>
> print(num_cores)
>
> lock = multiprocessing.Lock()
>
> def Search(i):
>
> suppl= Chem.SDMolSupplier(i)
>
> for mol in suppl:
>
> if mol is None: continue
>
> if mol.HasSubstructMatch(patt):
>
> matches.append(mol)
>
> return matches
>
> results = Parallel(n_jobs=20)(delayed(Search)(i) for i in l)
>
>
>
> We also wish to use a second script  that opens one SDF file and then
> runs a loop over each molecule in the file. This is currently done
> serially and we were wondering if it could be made parallel.
>
>
>
> suppl = Chem.SDMolSupplier('Red3.sdf')
>
> *for* mol *in* suppl:
>
> patt = Chem.MolFromSmarts('NC(N)=O')
>
> num=mol.GetSubstructMatches(patt)
>
> logger.debug(Chem.MolToSmiles(mol))
>
> h=len(num)
>
> m3=Chem.AddHs(mol)
>
> cids =AllChem.EmbedMultipleConfs(m3, numConfs)
>
>
>
> Any comments can be useful.
>
>
> Thanks a lot,
>
> Stamatia Zavitsanou
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Using RdKit in Parallel

2019-02-20 Thread Stamatia Zavitsanou
Hello everyone,


We have been writing a script that searches though a large number of molecules 
within different files for a common substructure. To speed this up we have been 
attempting to run this script in parallel-see scripts below. However online the 
tutorial notes make reference to problems with using the SDMolSupplier in 
parallel, we were wondering what is the issue and how we could circumvent them 
to speed up some of our calculations.


Non-parallel


from __future__ import print_function

from rdkit import Chem

import os

from progressbar import ProgressBar

pbar=ProgressBar()

matches = []

directory = 'Q:\Data2'

patt = Chem.MolFromSmarts('NC(NNC=O)=O')

for file in pbar(os.listdir(directory)):

filename = os.fsdecode(file)

if filename.endswith(".sdf"):

f = os.path.join(directory,filename)

suppl= Chem.SDMolSupplier(f)

for mol in suppl:

if mol is None: continue

if mol.HasSubstructMatch(patt):

matches.append(mol)

w = Chem.SDWriter(r'C:\Users\tom.watts\Desktop\datasmarts4c.sdf')

for m in matches: w.write(m)

print(filename)



Parallel


pbar=ProgressBar()

matches = []

directory = 'E:\Data'

patt = Chem.MolFromSmarts('NC(NNC=O)=O')

w = Chem.SDWriter(r'C:\Users\tom.watts\Desktop\SearchDataNonly.sdf')

l=[]

for file in pbar(os.listdir(directory)):

filename = os.fsdecode(file)

if filename.endswith(".sdf"):

f = os.path.join(directory,filename)

l.append(f)

num_cores = multiprocessing.cpu_count()

print(num_cores)

lock = multiprocessing.Lock()

def Search(i):

suppl= Chem.SDMolSupplier(i)

for mol in suppl:

if mol is None: continue

if mol.HasSubstructMatch(patt):

matches.append(mol)

return matches

results = Parallel(n_jobs=20)(delayed(Search)(i) for i in l)



We also wish to use a second script  that opens one SDF file and then runs a 
loop over each molecule in the file. This is currently done serially and we 
were wondering if it could be made parallel.



suppl = Chem.SDMolSupplier('Red3.sdf')

for mol in suppl:

patt = Chem.MolFromSmarts('NC(N)=O')

num=mol.GetSubstructMatches(patt)

logger.debug(Chem.MolToSmiles(mol))

h=len(num)

m3=Chem.AddHs(mol)

cids =AllChem.EmbedMultipleConfs(m3, numConfs)



Any comments can be useful.


Thanks a lot,

Stamatia Zavitsanou
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss