Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-27 Thread Dmitri Maziuk
On 6/26/2015 9:48 AM, az wrote:
 Thanks Jean-Paul

 You're right that I eat up a lot of memory with large files but I think
 its not the whole story. If it were, my memory should come back each
 time a new file is being read (jobs=[]), no ?

No. It's a feature of garbage collection: your memory may come back 
anytime between then and program exit.

Dima


--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-27 Thread David Hall
 On Jun 27, 2015, at 6:05 AM, Dmitri Maziuk dmaz...@bmrb.wisc.edu wrote:
 
 On 6/26/2015 9:48 AM, az wrote:
 Thanks Jean-Paul
 
 You're right that I eat up a lot of memory with large files but I think
 its not the whole story. If it were, my memory should come back each
 time a new file is being read (jobs=[]), no ?
 
 No. It's a feature of garbage collection: your memory may come back 
 anytime between then and program exit.

One could always trigger garbage collection manually.

https://docs.python.org/2/library/gc.html#gc.collect 
https://docs.python.org/2/library/gc.html#gc.collect

And use gc.get_count() to make sure the count of objects went down with the 
collection.

-David--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-27 Thread Greg Landrum
I apologize that I haven't had a chance to look at this in detail yet, but
I can at least give a quick answer to the below:
Python uses a deterministic scheme for doing garbage collection based on
reference counting, so memory should be freed as soon as you do jobs=[].
That's assuming that the futures code (which I don't know) isn't doing
anything odd behind the scenes to hold onto references.

-greg

On Friday, June 26, 2015, az adam.zalew...@mail.com wrote:

  Thanks Jean-Paul

 You're right that I eat up a lot of memory with large files but I think
 its not the whole story. If it were, my memory should come back each time a
 new file is being read (jobs=[]), no ? Instead I hit my limit after 8-10
 very similar input files, even though the usage after 2-3 is around 1/3 of
 my RAM.

 Cheers,
 Adam

 On 24-Jun-15 17:38, JP wrote:

 Isn't the problem here that you are keeping an array (jobs) and you keep
 adding molecules to it never letting the garbage collector collect/clear
 any memory ?  If your file has a million molecules, you will have an array
 of a million molecules in memory...

  Why dont you process each single molecule (set name / remove similar
 confs etc / remove high energy stuff), write it to file and release it ? in
 the if mol: clause...

  Cheers
 JP

  -
 Jean-Paul Ebejer
 Early Stage Researcher

 On 24 June 2015 at 16:47, az adam.zalew...@mail.com
 javascript:_e(%7B%7D,'cvml','adam.zalew...@mail.com'); wrote:

  Hi

  Using the cookbook code as basis (apologies if I should have posted in
 the corresponding topic), I've put together a script to generate conformers
 for my smiles library. Works like a charm too, aside from the fact that
 after 10-20 hours, I'm out of RAM and swap (the memory consumption seems to
 be accumulating with each iteration). I'd appreciate any hints for getting
 this resolved (any other ones as well).

 Thanks a lot,
 Adam

 the code

 max_workers = 16

 def generateconformations(m, n, name=''):
 m = Chem.AddHs(m)
 ids=AllChem.EmbedMultipleConfs(m, numConfs=n, pruneRmsThresh=0.5,
 randomSeed=1)
 etable=[] ## Gathers conformer energies

 for id in ids:
 ff = AllChem.UFFGetMoleculeForceField(m, confId=id)
 ff.Minimize()
 etable.append(ff.CalcEnergy())

 return PropertyMol(m), list(ids), etable, name

 input_dir, output_dir = sys.argv[1:3]
 n = 75 ## Conformer number

 os.chdir(input_dir)
 for ifile in glob.glob('*.s*'):

 raw_file = open(ifile, 'r') ## To get back molecule name later on
 ofile = os.path.join(output_dir, 'conf_' + ifile)

 if 'smiles' in ifile:
 suppl = Chem.SmilesMolSupplier(ifile, titleLine=False,
 delimiter='\t')
 ofile = ofile.replace('.smiles', '.sdf')
 sdfinput = False

 if not os.path.isfile(ofile):

 writer = Chem.SDWriter(ofile)

 print 'Processing %s' %os.path.abspath(ifile),
 datetime.datetime.now()

 if sdfinput == False:
 with futures.ProcessPoolExecutor(max_workers=max_workers) as
 executor:
 # Submit a set of asynchronous jobs
 jobs = []

 for mol in suppl:
 if mol:
 raw_line = raw_file.readline().split()[1] ##
 extracting molecule name from the olriginal ifile
 job = executor.submit(generateconformations, mol,
 n, raw_line) ## returns molecules and associated ids / untill here the
 conformers cannot be pickled
 jobs.append(job)

 for job in jobs:
 mol, ids, etable, name = job.result()
 mol.SetProp(_Name, name) ## Restoring lost property
 mine = min(etable) ## Lowest conformer energy

 for i in ids:
 if etable[i]  mine + 20: ## Conformers with
 energies greater then min+20 will not be written
 ids.remove(i)
 for i in ids:
 for j in ids:
 if i != j:
 if AllChem.GetConformerRMS(mol, i, j) 
 0.5: ## 0.5 A threshold for keeping conformers
 ids.remove(j)
 for id in ids:
 writer.write(mol, confId=id)

 writer.close()

 else:
 print %s exists, skipping % ofile

 ===





 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 

Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-27 Thread Dmitri Maziuk
On 6/27/2015 5:45 AM, Greg Landrum wrote:
...
 That's assuming that the futures code (which I don't know) isn't doing
 anything odd behind the scenes to hold onto references.

Or every mol in supplier holds a pointer to c++ dll that python vm 
doesn't quite know how to garbage-collect, which keeps a 
still-referenced object inside the job, which means jobs=[] creates a 
new array without deleting the old jobs. Who knows.

The first answer was the right one: process one molecule at a time. Even 
better, split the input file into one-per-molecule, then use ec2 or 
condor or osg to run your one-at-a-time script on all of them at once.

Dima


--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss