subject:"\[Rdkit\-discuss\] Memory issue when storing more than 300K mol in a list"

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-10 Thread Dimitri Maziuk


On 2017-06-10 07:42, Chris Swain wrote:
This sounds like the situation where a database might be a better 
option, tuned to store fingerprints in RAM?


The issue is how much programming time it will take, how much that time 
is worth, and how many times the solution will be reused. A clever 
coding solution could be preferable for other reasons, like a 
programming exercise. If it's a one-off and you just need it done and 
move on, throwing more hardware at it is often the most cost-effective 
solution.


Dima



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-10 Thread Chris Swain

This sounds like the situation where a database might be a better option, tuned 
to store fingerprints in RAM?

Chris


Dr Chris Swain BA MA (Cantab) PhD CChem FRSC
Macs in Chemistry
sw...@mac.com
http://www.macinchem.org



> On 10 Jun 2017, at 13:10, rdkit-discuss-requ...@lists.sourceforge.net wrote:
> 
> Send Rdkit-discuss mailing list submissions to
>   rdkit-discuss@lists.sourceforge.net
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>   https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> or, via email, send a message with subject or body 'help' to
>   rdkit-discuss-requ...@lists.sourceforge.net
> 
> You can reach the person managing the list at
>   rdkit-discuss-ow...@lists.sourceforge.net
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Rdkit-discuss digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Memory issue when storing more than 300K mol in a list
>  (Alexis Parenty)
>   2. Re: Memory issue when storing more than 300K mol in a list
>  (Dimitri Maziuk)
> 
> 
> --
> 
> Message: 1
> Date: Fri, 9 Jun 2017 16:28:09 +0200
> From: Alexis Parenty 
> To: Greg Landrum 
> Cc: RDKit Discuss 
> Subject: Re: [Rdkit-discuss] Memory issue when storing more than 300K
>   mol in a list
> Message-ID:
>   
> Content-Type: text/plain; charset="utf-8"
> 
> Yes Greg, this is what I am doing. You?re right, I did not think of the
> possibility to build a list of mol from the shorter list and process each
> of its mol with the mol of the longer list (which I would make on the
> flight from the smiles). However, I wanted to store the longest list of
> structures in order to access it again later for new substructure search
> from single structure at a time? It seemed silly to have to rebuild mol
> object from a 500K list of smiles every time I need to do a new
> substructure search. But your approach is going to help me a lot for the
> batch mode search I wanted to do.
> 
> Best,
> 
> Alexis
> 

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Dimitri Maziuk


On 2017-06-09 08:12, Alexis Parenty wrote:

Dear Greg and Brian,
Many thanks for your response. I was also thinking of your streaming 
approach! I think the RAM of most machine would deal with lists of 100K 
mol so we could put the threshold higher than 1000. Actually, I was 
thinking to monitor the available RAM and only start processing the 
matrix and clearing the list when less than 20% of RAM is left. This 
way, the best machines could skip the clearing process and gain time. 
What do you think?


Take $100, buy a 200GB SSD, set it up as the swap space, don't worry 
about the RAM.


Dima



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty

Yes Greg, this is what I am doing. You’re right, I did not think of the
possibility to build a list of mol from the shorter list and process each
of its mol with the mol of the longer list (which I would make on the
flight from the smiles). However, I wanted to store the longest list of
structures in order to access it again later for new substructure search
from single structure at a time… It seemed silly to have to rebuild mol
object from a 500K list of smiles every time I need to do a new
substructure search. But your approach is going to help me a lot for the
batch mode search I wanted to do.

Best,

Alexis

On 9 June 2017 at 15:42, Greg Landrum  wrote:

> Hi Alexis,
>
> If I understand your use case correctly, you really don't need this level
> of complication.
>
> If you are comparing Q molecules to M molecules and M>>Q (in the
> discussion so far Q = 1000, M = 50) and you only need to compare each
> of the Qs to each of the Ms a single time, you can safely construct all the
> Q molecules and store them in memory and then loop over the Ms individually
> and compare them to each of the Qs (this is what I did in my little
> sample). This will have more or less exactly the same performance as
> reading all of the Ms at once and then processing them.
>
> so, on a machine with infinite memory these two snippets will take more or
> less the same amount of time to execute:
>
> low memory usage:
>
> queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
> not None]
> matches = []
> for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
> if m is None:
> continue
> matches.append([m.HasSubstructMatch(q) for q in queries])
>
>
>
> high memory usage:
>
> queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
> not None]
> mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not
> None]
> matches = []
> for m in mols:
> if m is None:
> continue
> matches.append([m.HasSubstructMatch(q) for q in queries])
>
>
>
> The second form consumes a lot more memory without delivering any
> improvement in performance.
>
> Best,
> -greg
>
>
> On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Hi again, FYI here is the memory monitoring in attachment. Thanks,
>>
>> Alexis
>>
>> On 9 June 2017 at 15:12, Alexis Parenty 
>> wrote:
>>
>>> Dear Greg and Brian,
>>> Many thanks for your response. I was also thinking of your streaming
>>> approach! I think the RAM of most machine would deal with lists of 100K mol
>>> so we could put the threshold higher than 1000. Actually, I was thinking to
>>> monitor the available RAM and only start processing the matrix and clearing
>>> the list when less than 20% of RAM is left. This way, the best machines
>>> could skip the clearing process and gain time. What do you think?
>>>
>>>
>>> Best,
>>>
>>> Alexis
>>>
>>>
>>>
>>>
>>>
>>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>>>
 While not multithreaded (yet) this is the use case of the filter
 catalog:

 http://rdkit.blogspot.com/2016/04/changes-in-201603-release-
 filtercatalog.html?m=1

 Look for the SmartsMatcher class in the blog.

 It is a good idea to make this multithreaded as well, I'll add this as
 a possible enhancement.

 
 Brian Kelley

 On Jun 9, 2017, at 7:04 AM, Greg Landrum 
 wrote:

 Hi Alexis,

 I would approach this by loading the 1000 queries into a list of
 molecules and then "stream" the others past that (so that you never attempt
 to load the full 500K set at once).

 Here's a quick sketch of one way to do this:

 In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
 if x is not None]

 In [5]: matches = []

 In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
...: if m is None:
...: continue
...: matches.append([m.HasSubstructMatch(q) for q in queries])
...:



 Brian has some thoughts on making this particular use case
 easier/faster (in particular by adding multi-threading support), so maybe
 there will be something in the next release there.

 I hope this helps,
 -greg


 On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
 alexis.parenty.h...@gmail.com> wrote:

> Dear RDKit community,
>
> I need to screen for substructure relationships between two sets of
> structures (1 000 X 500 000): I thought I should build two lists of mol
> objects from SMILES, but I keep having a memory error when the second list
> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
> virtual memory.
>
> Do I really have to compromise on speed and make mol object on the
> flight from two lists of SMILES? Is there another memory efficient way to
> store mol object?
>
> Best,
>
> Alexis

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum

Hi Alexis,

If I understand your use case correctly, you really don't need this level
of complication.

If you are comparing Q molecules to M molecules and M>>Q (in the discussion
so far Q = 1000, M = 50) and you only need to compare each of the Qs to
each of the Ms a single time, you can safely construct all the Q molecules
and store them in memory and then loop over the Ms individually and compare
them to each of the Qs (this is what I did in my little sample). This will
have more or less exactly the same performance as reading all of the Ms at
once and then processing them.

so, on a machine with infinite memory these two snippets will take more or
less the same amount of time to execute:

low memory usage:

queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
not None]
matches = []
for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
if m is None:
continue
matches.append([m.HasSubstructMatch(q) for q in queries])



high memory usage:

queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
not None]
mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not
None]
matches = []
for m in mols:
if m is None:
continue
matches.append([m.HasSubstructMatch(q) for q in queries])



The second form consumes a lot more memory without delivering any
improvement in performance.

Best,
-greg


On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Hi again, FYI here is the memory monitoring in attachment. Thanks,
>
> Alexis
>
> On 9 June 2017 at 15:12, Alexis Parenty 
> wrote:
>
>> Dear Greg and Brian,
>> Many thanks for your response. I was also thinking of your streaming
>> approach! I think the RAM of most machine would deal with lists of 100K mol
>> so we could put the threshold higher than 1000. Actually, I was thinking to
>> monitor the available RAM and only start processing the matrix and clearing
>> the list when less than 20% of RAM is left. This way, the best machines
>> could skip the clearing process and gain time. What do you think?
>>
>>
>> Best,
>>
>> Alexis
>>
>>
>>
>>
>>
>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>>
>>> While not multithreaded (yet) this is the use case of the filter catalog:
>>>
>>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-
>>> filtercatalog.html?m=1
>>>
>>> Look for the SmartsMatcher class in the blog.
>>>
>>> It is a good idea to make this multithreaded as well, I'll add this as a
>>> possible enhancement.
>>>
>>> 
>>> Brian Kelley
>>>
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>>>
>>> Hi Alexis,
>>>
>>> I would approach this by loading the 1000 queries into a list of
>>> molecules and then "stream" the others past that (so that you never attempt
>>> to load the full 500K set at once).
>>>
>>> Here's a quick sketch of one way to do this:
>>>
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
>>> if x is not None]
>>>
>>> In [5]: matches = []
>>>
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>...: if m is None:
>>>...: continue
>>>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>>>...:
>>>
>>>
>>>
>>> Brian has some thoughts on making this particular use case easier/faster
>>> (in particular by adding multi-threading support), so maybe there will be
>>> something in the next release there.
>>>
>>> I hope this helps,
>>> -greg
>>>
>>>
>>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
 Dear RDKit community,

 I need to screen for substructure relationships between two sets of
 structures (1 000 X 500 000): I thought I should build two lists of mol
 objects from SMILES, but I keep having a memory error when the second list
 reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
 virtual memory.

 Do I really have to compromise on speed and make mol object on the
 flight from two lists of SMILES? Is there another memory efficient way to
 store mol object?

 Best,

 Alexis

 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>
-

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley

 What exactly are you doing?

Is this 1000x500k substructure queries or something different?


Brian Kelley

> On Jun 9, 2017, at 9:12 AM, Alexis Parenty  
> wrote:
> 
> Dear Greg and Brian, 
> Many thanks for your response. I was also thinking of your streaming 
> approach! I think the RAM of most machine would deal with lists of 100K mol 
> so we could put the threshold higher than 1000. Actually, I was thinking to 
> monitor the available RAM and only start processing the matrix and clearing 
> the list when less than 20% of RAM is left. This way, the best machines could 
> skip the clearing process and gain time. What do you think?
> 
> 
> Best,
> 
> Alexis
> 
> 
> 
> 
> 
>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>> While not multithreaded (yet) this is the use case of the filter catalog:
>> 
>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1
>> 
>> Look for the SmartsMatcher class in the blog.
>> 
>> It is a good idea to make this multithreaded as well, I'll add this as a 
>> possible enhancement.
>> 
>> 
>> Brian Kelley
>> 
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>>> 
>>> Hi Alexis,
>>> 
>>> I would approach this by loading the 1000 queries into a list of molecules 
>>> and then "stream" the others past that (so that you never attempt to load 
>>> the full 500K set at once).
>>> 
>>> Here's a quick sketch of one way to do this:
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if 
>>> x is not None]
>>> 
>>> In [5]: matches = []
>>> 
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>...: if m is None:
>>>...: continue
>>>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>>>...: 
>>> 
>>> 
>>> Brian has some thoughts on making this particular use case easier/faster 
>>> (in particular by adding multi-threading support), so maybe there will be 
>>> something in the next release there.
>>> 
>>> I hope this helps,
>>> -greg
>>> 
>>> 
 On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty 
  wrote:
 Dear RDKit community,
 
 I need to screen for substructure relationships between two sets of 
 structures (1 000 X 500 000): I thought I should build two lists of mol 
 objects from SMILES, but I keep having a memory error when the second list 
 reaches 300 000 mol. All my RAM (12G) gets consumed along with all my 
 virtual memory.
 
 Do I really have to compromise on speed and make mol object on the flight 
 from two lists of SMILES? Is there another memory efficient way to store 
 mol object?
 
 Best,
 
 Alexis
 
 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty

Dear Greg and Brian,
Many thanks for your response. I was also thinking of your streaming
approach! I think the RAM of most machine would deal with lists of 100K mol
so we could put the threshold higher than 1000. Actually, I was thinking to
monitor the available RAM and only start processing the matrix and clearing
the list when less than 20% of RAM is left. This way, the best machines
could skip the clearing process and gain time. What do you think?


Best,

Alexis





On 9 June 2017 at 14:40, Brian Kelley  wrote:

> While not multithreaded (yet) this is the use case of the filter catalog:
>
> http://rdkit.blogspot.com/2016/04/changes-in-201603-
> release-filtercatalog.html?m=1
>
> Look for the SmartsMatcher class in the blog.
>
> It is a good idea to make this multithreaded as well, I'll add this as a
> possible enhancement.
>
> 
> Brian Kelley
>
> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>
> Hi Alexis,
>
> I would approach this by loading the 1000 queries into a list of molecules
> and then "stream" the others past that (so that you never attempt to load
> the full 500K set at once).
>
> Here's a quick sketch of one way to do this:
>
> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
> if x is not None]
>
> In [5]: matches = []
>
> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>...: if m is None:
>...: continue
>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>...:
>
>
>
> Brian has some thoughts on making this particular use case easier/faster
> (in particular by adding multi-threading support), so maybe there will be
> something in the next release there.
>
> I hope this helps,
> -greg
>
>
> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Dear RDKit community,
>>
>> I need to screen for substructure relationships between two sets of
>> structures (1 000 X 500 000): I thought I should build two lists of mol
>> objects from SMILES, but I keep having a memory error when the second list
>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
>> virtual memory.
>>
>> Do I really have to compromise on speed and make mol object on the flight
>> from two lists of SMILES? Is there another memory efficient way to store
>> mol object?
>>
>> Best,
>>
>> Alexis
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley

While not multithreaded (yet) this is the use case of the filter catalog:

http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1

Look for the SmartsMatcher class in the blog.

It is a good idea to make this multithreaded as well, I'll add this as a 
possible enhancement.


Brian Kelley

> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
> 
> Hi Alexis,
> 
> I would approach this by loading the 1000 queries into a list of molecules 
> and then "stream" the others past that (so that you never attempt to load the 
> full 500K set at once).
> 
> Here's a quick sketch of one way to do this:
> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x 
> is not None]
> 
> In [5]: matches = []
> 
> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>...: if m is None:
>...: continue
>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>...: 
> 
> 
> Brian has some thoughts on making this particular use case easier/faster (in 
> particular by adding multi-threading support), so maybe there will be 
> something in the next release there.
> 
> I hope this helps,
> -greg
> 
> 
>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty 
>>  wrote:
>> Dear RDKit community,
>> 
>> I need to screen for substructure relationships between two sets of 
>> structures (1 000 X 500 000): I thought I should build two lists of mol 
>> objects from SMILES, but I keep having a memory error when the second list 
>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my 
>> virtual memory.
>> 
>> Do I really have to compromise on speed and make mol object on the flight 
>> from two lists of SMILES? Is there another memory efficient way to store mol 
>> object?
>> 
>> Best,
>> 
>> Alexis
>> 
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> 
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum

Hi Alexis,

I would approach this by loading the 1000 queries into a list of molecules
and then "stream" the others past that (so that you never attempt to load
the full 500K set at once).

Here's a quick sketch of one way to do this:

In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if
x is not None]

In [5]: matches = []

In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
   ...: if m is None:
   ...: continue
   ...: matches.append([m.HasSubstructMatch(q) for q in queries])
   ...:



Brian has some thoughts on making this particular use case easier/faster
(in particular by adding multi-threading support), so maybe there will be
something in the next release there.

I hope this helps,
-greg


On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Dear RDKit community,
>
> I need to screen for substructure relationships between two sets of
> structures (1 000 X 500 000): I thought I should build two lists of mol
> objects from SMILES, but I keep having a memory error when the second list
> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
> virtual memory.
>
> Do I really have to compromise on speed and make mol object on the flight
> from two lists of SMILES? Is there another memory efficient way to store
> mol object?
>
> Best,
>
> Alexis
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-04 Thread Alexis Parenty

Dear RDKit community,

I need to screen for substructure relationships between two sets of
structures (1 000 X 500 000): I thought I should build two lists of mol
objects from SMILES, but I keep having a memory error when the second list
reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
virtual memory.

Do I really have to compromise on speed and make mol object on the flight
from two lists of SMILES? Is there another memory efficient way to store
mol object?

Best,

Alexis
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

[Rdkit-discuss] Memory issue when storing more than 300K mol in a list

10 matches

Site Navigation

Mail list logo

Footer information