Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Dimitri Maziuk

On 2017-06-09 08:12, Alexis Parenty wrote:

Dear Greg and Brian,
Many thanks for your response. I was also thinking of your streaming 
approach! I think the RAM of most machine would deal with lists of 100K 
mol so we could put the threshold higher than 1000. Actually, I was 
thinking to monitor the available RAM and only start processing the 
matrix and clearing the list when less than 20% of RAM is left. This 
way, the best machines could skip the clearing process and gain time. 
What do you think?


Take $100, buy a 200GB SSD, set it up as the swap space, don't worry 
about the RAM.


Dima



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty
Yes Greg, this is what I am doing. You’re right, I did not think of the
possibility to build a list of mol from the shorter list and process each
of its mol with the mol of the longer list (which I would make on the
flight from the smiles). However, I wanted to store the longest list of
structures in order to access it again later for new substructure search
from single structure at a time… It seemed silly to have to rebuild mol
object from a 500K list of smiles every time I need to do a new
substructure search. But your approach is going to help me a lot for the
batch mode search I wanted to do.

Best,

Alexis

On 9 June 2017 at 15:42, Greg Landrum  wrote:

> Hi Alexis,
>
> If I understand your use case correctly, you really don't need this level
> of complication.
>
> If you are comparing Q molecules to M molecules and M>>Q (in the
> discussion so far Q = 1000, M = 50) and you only need to compare each
> of the Qs to each of the Ms a single time, you can safely construct all the
> Q molecules and store them in memory and then loop over the Ms individually
> and compare them to each of the Qs (this is what I did in my little
> sample). This will have more or less exactly the same performance as
> reading all of the Ms at once and then processing them.
>
> so, on a machine with infinite memory these two snippets will take more or
> less the same amount of time to execute:
>
> low memory usage:
>
> queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
> not None]
> matches = []
> for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
> if m is None:
> continue
> matches.append([m.HasSubstructMatch(q) for q in queries])
>
>
>
> high memory usage:
>
> queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
> not None]
> mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not
> None]
> matches = []
> for m in mols:
> if m is None:
> continue
> matches.append([m.HasSubstructMatch(q) for q in queries])
>
>
>
> The second form consumes a lot more memory without delivering any
> improvement in performance.
>
> Best,
> -greg
>
>
> On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Hi again, FYI here is the memory monitoring in attachment. Thanks,
>>
>> Alexis
>>
>> On 9 June 2017 at 15:12, Alexis Parenty 
>> wrote:
>>
>>> Dear Greg and Brian,
>>> Many thanks for your response. I was also thinking of your streaming
>>> approach! I think the RAM of most machine would deal with lists of 100K mol
>>> so we could put the threshold higher than 1000. Actually, I was thinking to
>>> monitor the available RAM and only start processing the matrix and clearing
>>> the list when less than 20% of RAM is left. This way, the best machines
>>> could skip the clearing process and gain time. What do you think?
>>>
>>>
>>> Best,
>>>
>>> Alexis
>>>
>>>
>>>
>>>
>>>
>>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>>>
 While not multithreaded (yet) this is the use case of the filter
 catalog:

 http://rdkit.blogspot.com/2016/04/changes-in-201603-release-
 filtercatalog.html?m=1

 Look for the SmartsMatcher class in the blog.

 It is a good idea to make this multithreaded as well, I'll add this as
 a possible enhancement.

 
 Brian Kelley

 On Jun 9, 2017, at 7:04 AM, Greg Landrum 
 wrote:

 Hi Alexis,

 I would approach this by loading the 1000 queries into a list of
 molecules and then "stream" the others past that (so that you never attempt
 to load the full 500K set at once).

 Here's a quick sketch of one way to do this:

 In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
 if x is not None]

 In [5]: matches = []

 In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
...: if m is None:
...: continue
...: matches.append([m.HasSubstructMatch(q) for q in queries])
...:



 Brian has some thoughts on making this particular use case
 easier/faster (in particular by adding multi-threading support), so maybe
 there will be something in the next release there.

 I hope this helps,
 -greg


 On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
 alexis.parenty.h...@gmail.com> wrote:

> Dear RDKit community,
>
> I need to screen for substructure relationships between two sets of
> structures (1 000 X 500 000): I thought I should build two lists of mol
> objects from SMILES, but I keep having a memory error when the second list
> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
> virtual memory.
>
> Do I really have to compromise on speed and make mol object on the
> flight from two lists of SMILES? 

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum
Hi Alexis,

If I understand your use case correctly, you really don't need this level
of complication.

If you are comparing Q molecules to M molecules and M>>Q (in the discussion
so far Q = 1000, M = 50) and you only need to compare each of the Qs to
each of the Ms a single time, you can safely construct all the Q molecules
and store them in memory and then loop over the Ms individually and compare
them to each of the Qs (this is what I did in my little sample). This will
have more or less exactly the same performance as reading all of the Ms at
once and then processing them.

so, on a machine with infinite memory these two snippets will take more or
less the same amount of time to execute:

low memory usage:

queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
not None]
matches = []
for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
if m is None:
continue
matches.append([m.HasSubstructMatch(q) for q in queries])



high memory usage:

queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
not None]
mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not
None]
matches = []
for m in mols:
if m is None:
continue
matches.append([m.HasSubstructMatch(q) for q in queries])



The second form consumes a lot more memory without delivering any
improvement in performance.

Best,
-greg


On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Hi again, FYI here is the memory monitoring in attachment. Thanks,
>
> Alexis
>
> On 9 June 2017 at 15:12, Alexis Parenty 
> wrote:
>
>> Dear Greg and Brian,
>> Many thanks for your response. I was also thinking of your streaming
>> approach! I think the RAM of most machine would deal with lists of 100K mol
>> so we could put the threshold higher than 1000. Actually, I was thinking to
>> monitor the available RAM and only start processing the matrix and clearing
>> the list when less than 20% of RAM is left. This way, the best machines
>> could skip the clearing process and gain time. What do you think?
>>
>>
>> Best,
>>
>> Alexis
>>
>>
>>
>>
>>
>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>>
>>> While not multithreaded (yet) this is the use case of the filter catalog:
>>>
>>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-
>>> filtercatalog.html?m=1
>>>
>>> Look for the SmartsMatcher class in the blog.
>>>
>>> It is a good idea to make this multithreaded as well, I'll add this as a
>>> possible enhancement.
>>>
>>> 
>>> Brian Kelley
>>>
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>>>
>>> Hi Alexis,
>>>
>>> I would approach this by loading the 1000 queries into a list of
>>> molecules and then "stream" the others past that (so that you never attempt
>>> to load the full 500K set at once).
>>>
>>> Here's a quick sketch of one way to do this:
>>>
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
>>> if x is not None]
>>>
>>> In [5]: matches = []
>>>
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>...: if m is None:
>>>...: continue
>>>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>>>...:
>>>
>>>
>>>
>>> Brian has some thoughts on making this particular use case easier/faster
>>> (in particular by adding multi-threading support), so maybe there will be
>>> something in the next release there.
>>>
>>> I hope this helps,
>>> -greg
>>>
>>>
>>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
 Dear RDKit community,

 I need to screen for substructure relationships between two sets of
 structures (1 000 X 500 000): I thought I should build two lists of mol
 objects from SMILES, but I keep having a memory error when the second list
 reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
 virtual memory.

 Do I really have to compromise on speed and make mol object on the
 flight from two lists of SMILES? Is there another memory efficient way to
 store mol object?

 Best,

 Alexis

 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley
 What exactly are you doing?

Is this 1000x500k substructure queries or something different?


Brian Kelley

> On Jun 9, 2017, at 9:12 AM, Alexis Parenty  
> wrote:
> 
> Dear Greg and Brian, 
> Many thanks for your response. I was also thinking of your streaming 
> approach! I think the RAM of most machine would deal with lists of 100K mol 
> so we could put the threshold higher than 1000. Actually, I was thinking to 
> monitor the available RAM and only start processing the matrix and clearing 
> the list when less than 20% of RAM is left. This way, the best machines could 
> skip the clearing process and gain time. What do you think?
> 
> 
> Best,
> 
> Alexis
> 
> 
> 
> 
> 
>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>> While not multithreaded (yet) this is the use case of the filter catalog:
>> 
>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1
>> 
>> Look for the SmartsMatcher class in the blog.
>> 
>> It is a good idea to make this multithreaded as well, I'll add this as a 
>> possible enhancement.
>> 
>> 
>> Brian Kelley
>> 
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>>> 
>>> Hi Alexis,
>>> 
>>> I would approach this by loading the 1000 queries into a list of molecules 
>>> and then "stream" the others past that (so that you never attempt to load 
>>> the full 500K set at once).
>>> 
>>> Here's a quick sketch of one way to do this:
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if 
>>> x is not None]
>>> 
>>> In [5]: matches = []
>>> 
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>...: if m is None:
>>>...: continue
>>>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>>>...: 
>>> 
>>> 
>>> Brian has some thoughts on making this particular use case easier/faster 
>>> (in particular by adding multi-threading support), so maybe there will be 
>>> something in the next release there.
>>> 
>>> I hope this helps,
>>> -greg
>>> 
>>> 
 On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty 
  wrote:
 Dear RDKit community,
 
 I need to screen for substructure relationships between two sets of 
 structures (1 000 X 500 000): I thought I should build two lists of mol 
 objects from SMILES, but I keep having a memory error when the second list 
 reaches 300 000 mol. All my RAM (12G) gets consumed along with all my 
 virtual memory.
 
 Do I really have to compromise on speed and make mol object on the flight 
 from two lists of SMILES? Is there another memory efficient way to store 
 mol object?
 
 Best,
 
 Alexis
 
 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty
Dear Greg and Brian,
Many thanks for your response. I was also thinking of your streaming
approach! I think the RAM of most machine would deal with lists of 100K mol
so we could put the threshold higher than 1000. Actually, I was thinking to
monitor the available RAM and only start processing the matrix and clearing
the list when less than 20% of RAM is left. This way, the best machines
could skip the clearing process and gain time. What do you think?


Best,

Alexis





On 9 June 2017 at 14:40, Brian Kelley  wrote:

> While not multithreaded (yet) this is the use case of the filter catalog:
>
> http://rdkit.blogspot.com/2016/04/changes-in-201603-
> release-filtercatalog.html?m=1
>
> Look for the SmartsMatcher class in the blog.
>
> It is a good idea to make this multithreaded as well, I'll add this as a
> possible enhancement.
>
> 
> Brian Kelley
>
> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>
> Hi Alexis,
>
> I would approach this by loading the 1000 queries into a list of molecules
> and then "stream" the others past that (so that you never attempt to load
> the full 500K set at once).
>
> Here's a quick sketch of one way to do this:
>
> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
> if x is not None]
>
> In [5]: matches = []
>
> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>...: if m is None:
>...: continue
>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>...:
>
>
>
> Brian has some thoughts on making this particular use case easier/faster
> (in particular by adding multi-threading support), so maybe there will be
> something in the next release there.
>
> I hope this helps,
> -greg
>
>
> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Dear RDKit community,
>>
>> I need to screen for substructure relationships between two sets of
>> structures (1 000 X 500 000): I thought I should build two lists of mol
>> objects from SMILES, but I keep having a memory error when the second list
>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
>> virtual memory.
>>
>> Do I really have to compromise on speed and make mol object on the flight
>> from two lists of SMILES? Is there another memory efficient way to store
>> mol object?
>>
>> Best,
>>
>> Alexis
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley
While not multithreaded (yet) this is the use case of the filter catalog:

http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1

Look for the SmartsMatcher class in the blog.

It is a good idea to make this multithreaded as well, I'll add this as a 
possible enhancement.


Brian Kelley

> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
> 
> Hi Alexis,
> 
> I would approach this by loading the 1000 queries into a list of molecules 
> and then "stream" the others past that (so that you never attempt to load the 
> full 500K set at once).
> 
> Here's a quick sketch of one way to do this:
> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x 
> is not None]
> 
> In [5]: matches = []
> 
> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>...: if m is None:
>...: continue
>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>...: 
> 
> 
> Brian has some thoughts on making this particular use case easier/faster (in 
> particular by adding multi-threading support), so maybe there will be 
> something in the next release there.
> 
> I hope this helps,
> -greg
> 
> 
>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty 
>>  wrote:
>> Dear RDKit community,
>> 
>> I need to screen for substructure relationships between two sets of 
>> structures (1 000 X 500 000): I thought I should build two lists of mol 
>> objects from SMILES, but I keep having a memory error when the second list 
>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my 
>> virtual memory.
>> 
>> Do I really have to compromise on speed and make mol object on the flight 
>> from two lists of SMILES? Is there another memory efficient way to store mol 
>> object?
>> 
>> Best,
>> 
>> Alexis
>> 
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> 
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum
Hi Alexis,

I would approach this by loading the 1000 queries into a list of molecules
and then "stream" the others past that (so that you never attempt to load
the full 500K set at once).

Here's a quick sketch of one way to do this:

In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if
x is not None]

In [5]: matches = []

In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
   ...: if m is None:
   ...: continue
   ...: matches.append([m.HasSubstructMatch(q) for q in queries])
   ...:



Brian has some thoughts on making this particular use case easier/faster
(in particular by adding multi-threading support), so maybe there will be
something in the next release there.

I hope this helps,
-greg


On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Dear RDKit community,
>
> I need to screen for substructure relationships between two sets of
> structures (1 000 X 500 000): I thought I should build two lists of mol
> objects from SMILES, but I keep having a memory error when the second list
> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
> virtual memory.
>
> Do I really have to compromise on speed and make mol object on the flight
> from two lists of SMILES? Is there another memory efficient way to store
> mol object?
>
> Best,
>
> Alexis
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss