Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
On 2017-06-10 07:42, Chris Swain wrote: This sounds like the situation where a database might be a better option, tuned to store fingerprints in RAM? The issue is how much programming time it will take, how much that time is worth, and how many times the solution will be reused. A clever coding solution could be preferable for other reasons, like a programming exercise. If it's a one-off and you just need it done and move on, throwing more hardware at it is often the most cost-effective solution. Dima -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
This sounds like the situation where a database might be a better option, tuned to store fingerprints in RAM? Chris Dr Chris Swain BA MA (Cantab) PhD CChem FRSC Macs in Chemistry sw...@mac.com http://www.macinchem.org > On 10 Jun 2017, at 13:10, rdkit-discuss-requ...@lists.sourceforge.net wrote: > > Send Rdkit-discuss mailing list submissions to > rdkit-discuss@lists.sourceforge.net > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > or, via email, send a message with subject or body 'help' to > rdkit-discuss-requ...@lists.sourceforge.net > > You can reach the person managing the list at > rdkit-discuss-ow...@lists.sourceforge.net > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Rdkit-discuss digest..." > > > Today's Topics: > > 1. Re: Memory issue when storing more than 300K mol in a list > (Alexis Parenty) > 2. Re: Memory issue when storing more than 300K mol in a list > (Dimitri Maziuk) > > > -- > > Message: 1 > Date: Fri, 9 Jun 2017 16:28:09 +0200 > From: Alexis Parenty > To: Greg Landrum > Cc: RDKit Discuss > Subject: Re: [Rdkit-discuss] Memory issue when storing more than 300K > mol in a list > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Yes Greg, this is what I am doing. You?re right, I did not think of the > possibility to build a list of mol from the shorter list and process each > of its mol with the mol of the longer list (which I would make on the > flight from the smiles). However, I wanted to store the longest list of > structures in order to access it again later for new substructure search > from single structure at a time? It seemed silly to have to rebuild mol > object from a 500K list of smiles every time I need to do a new > substructure search. But your approach is going to help me a lot for the > batch mode search I wanted to do. > > Best, > > Alexis > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
On 2017-06-09 08:12, Alexis Parenty wrote: Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor the available RAM and only start processing the matrix and clearing the list when less than 20% of RAM is left. This way, the best machines could skip the clearing process and gain time. What do you think? Take $100, buy a 200GB SSD, set it up as the swap space, don't worry about the RAM. Dima -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Yes Greg, this is what I am doing. You’re right, I did not think of the possibility to build a list of mol from the shorter list and process each of its mol with the mol of the longer list (which I would make on the flight from the smiles). However, I wanted to store the longest list of structures in order to access it again later for new substructure search from single structure at a time… It seemed silly to have to rebuild mol object from a 500K list of smiles every time I need to do a new substructure search. But your approach is going to help me a lot for the batch mode search I wanted to do. Best, Alexis On 9 June 2017 at 15:42, Greg Landrum wrote: > Hi Alexis, > > If I understand your use case correctly, you really don't need this level > of complication. > > If you are comparing Q molecules to M molecules and M>>Q (in the > discussion so far Q = 1000, M = 50) and you only need to compare each > of the Qs to each of the Ms a single time, you can safely construct all the > Q molecules and store them in memory and then loop over the Ms individually > and compare them to each of the Qs (this is what I did in my little > sample). This will have more or less exactly the same performance as > reading all of the Ms at once and then processing them. > > so, on a machine with infinite memory these two snippets will take more or > less the same amount of time to execute: > > low memory usage: > > queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is > not None] > matches = [] > for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): > if m is None: > continue > matches.append([m.HasSubstructMatch(q) for q in queries]) > > > > high memory usage: > > queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is > not None] > mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not > None] > matches = [] > for m in mols: > if m is None: > continue > matches.append([m.HasSubstructMatch(q) for q in queries]) > > > > The second form consumes a lot more memory without delivering any > improvement in performance. > > Best, > -greg > > > On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty < > alexis.parenty.h...@gmail.com> wrote: > >> Hi again, FYI here is the memory monitoring in attachment. Thanks, >> >> Alexis >> >> On 9 June 2017 at 15:12, Alexis Parenty >> wrote: >> >>> Dear Greg and Brian, >>> Many thanks for your response. I was also thinking of your streaming >>> approach! I think the RAM of most machine would deal with lists of 100K mol >>> so we could put the threshold higher than 1000. Actually, I was thinking to >>> monitor the available RAM and only start processing the matrix and clearing >>> the list when less than 20% of RAM is left. This way, the best machines >>> could skip the clearing process and gain time. What do you think? >>> >>> >>> Best, >>> >>> Alexis >>> >>> >>> >>> >>> >>> On 9 June 2017 at 14:40, Brian Kelley wrote: >>> While not multithreaded (yet) this is the use case of the filter catalog: http://rdkit.blogspot.com/2016/04/changes-in-201603-release- filtercatalog.html?m=1 Look for the SmartsMatcher class in the blog. It is a good idea to make this multithreaded as well, I'll add this as a possible enhancement. Brian Kelley On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: Hi Alexis, I would approach this by loading the 1000 queries into a list of molecules and then "stream" the others past that (so that you never attempt to load the full 500K set at once). Here's a quick sketch of one way to do this: In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] In [5]: matches = [] In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): ...: if m is None: ...: continue ...: matches.append([m.HasSubstructMatch(q) for q in queries]) ...: Brian has some thoughts on making this particular use case easier/faster (in particular by adding multi-threading support), so maybe there will be something in the next release there. I hope this helps, -greg On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Dear RDKit community, > > I need to screen for substructure relationships between two sets of > structures (1 000 X 500 000): I thought I should build two lists of mol > objects from SMILES, but I keep having a memory error when the second list > reaches 300 000 mol. All my RAM (12G) gets consumed along with all my > virtual memory. > > Do I really have to compromise on speed and make mol object on the > flight from two lists of SMILES? Is there another memory efficient way to > store mol object? > > Best, > > Alexis
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Hi Alexis, If I understand your use case correctly, you really don't need this level of complication. If you are comparing Q molecules to M molecules and M>>Q (in the discussion so far Q = 1000, M = 50) and you only need to compare each of the Qs to each of the Ms a single time, you can safely construct all the Q molecules and store them in memory and then loop over the Ms individually and compare them to each of the Qs (this is what I did in my little sample). This will have more or less exactly the same performance as reading all of the Ms at once and then processing them. so, on a machine with infinite memory these two snippets will take more or less the same amount of time to execute: low memory usage: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] matches = [] for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): if m is None: continue matches.append([m.HasSubstructMatch(q) for q in queries]) high memory usage: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not None] matches = [] for m in mols: if m is None: continue matches.append([m.HasSubstructMatch(q) for q in queries]) The second form consumes a lot more memory without delivering any improvement in performance. Best, -greg On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Hi again, FYI here is the memory monitoring in attachment. Thanks, > > Alexis > > On 9 June 2017 at 15:12, Alexis Parenty > wrote: > >> Dear Greg and Brian, >> Many thanks for your response. I was also thinking of your streaming >> approach! I think the RAM of most machine would deal with lists of 100K mol >> so we could put the threshold higher than 1000. Actually, I was thinking to >> monitor the available RAM and only start processing the matrix and clearing >> the list when less than 20% of RAM is left. This way, the best machines >> could skip the clearing process and gain time. What do you think? >> >> >> Best, >> >> Alexis >> >> >> >> >> >> On 9 June 2017 at 14:40, Brian Kelley wrote: >> >>> While not multithreaded (yet) this is the use case of the filter catalog: >>> >>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release- >>> filtercatalog.html?m=1 >>> >>> Look for the SmartsMatcher class in the blog. >>> >>> It is a good idea to make this multithreaded as well, I'll add this as a >>> possible enhancement. >>> >>> >>> Brian Kelley >>> >>> On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: >>> >>> Hi Alexis, >>> >>> I would approach this by loading the 1000 queries into a list of >>> molecules and then "stream" the others past that (so that you never attempt >>> to load the full 500K set at once). >>> >>> Here's a quick sketch of one way to do this: >>> >>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') >>> if x is not None] >>> >>> In [5]: matches = [] >>> >>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >>>...: if m is None: >>>...: continue >>>...: matches.append([m.HasSubstructMatch(q) for q in queries]) >>>...: >>> >>> >>> >>> Brian has some thoughts on making this particular use case easier/faster >>> (in particular by adding multi-threading support), so maybe there will be >>> something in the next release there. >>> >>> I hope this helps, >>> -greg >>> >>> >>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < >>> alexis.parenty.h...@gmail.com> wrote: >>> Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along with all my virtual memory. Do I really have to compromise on speed and make mol object on the flight from two lists of SMILES? Is there another memory efficient way to store mol object? Best, Alexis -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> >> > -
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
What exactly are you doing? Is this 1000x500k substructure queries or something different? Brian Kelley > On Jun 9, 2017, at 9:12 AM, Alexis Parenty > wrote: > > Dear Greg and Brian, > Many thanks for your response. I was also thinking of your streaming > approach! I think the RAM of most machine would deal with lists of 100K mol > so we could put the threshold higher than 1000. Actually, I was thinking to > monitor the available RAM and only start processing the matrix and clearing > the list when less than 20% of RAM is left. This way, the best machines could > skip the clearing process and gain time. What do you think? > > > Best, > > Alexis > > > > > >> On 9 June 2017 at 14:40, Brian Kelley wrote: >> While not multithreaded (yet) this is the use case of the filter catalog: >> >> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1 >> >> Look for the SmartsMatcher class in the blog. >> >> It is a good idea to make this multithreaded as well, I'll add this as a >> possible enhancement. >> >> >> Brian Kelley >> >>> On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: >>> >>> Hi Alexis, >>> >>> I would approach this by loading the 1000 queries into a list of molecules >>> and then "stream" the others past that (so that you never attempt to load >>> the full 500K set at once). >>> >>> Here's a quick sketch of one way to do this: >>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if >>> x is not None] >>> >>> In [5]: matches = [] >>> >>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >>>...: if m is None: >>>...: continue >>>...: matches.append([m.HasSubstructMatch(q) for q in queries]) >>>...: >>> >>> >>> Brian has some thoughts on making this particular use case easier/faster >>> (in particular by adding multi-threading support), so maybe there will be >>> something in the next release there. >>> >>> I hope this helps, >>> -greg >>> >>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty wrote: Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along with all my virtual memory. Do I really have to compromise on speed and make mol object on the flight from two lists of SMILES? Is there another memory efficient way to store mol object? Best, Alexis -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor the available RAM and only start processing the matrix and clearing the list when less than 20% of RAM is left. This way, the best machines could skip the clearing process and gain time. What do you think? Best, Alexis On 9 June 2017 at 14:40, Brian Kelley wrote: > While not multithreaded (yet) this is the use case of the filter catalog: > > http://rdkit.blogspot.com/2016/04/changes-in-201603- > release-filtercatalog.html?m=1 > > Look for the SmartsMatcher class in the blog. > > It is a good idea to make this multithreaded as well, I'll add this as a > possible enhancement. > > > Brian Kelley > > On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: > > Hi Alexis, > > I would approach this by loading the 1000 queries into a list of molecules > and then "stream" the others past that (so that you never attempt to load > the full 500K set at once). > > Here's a quick sketch of one way to do this: > > In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') > if x is not None] > > In [5]: matches = [] > > In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >...: if m is None: >...: continue >...: matches.append([m.HasSubstructMatch(q) for q in queries]) >...: > > > > Brian has some thoughts on making this particular use case easier/faster > (in particular by adding multi-threading support), so maybe there will be > something in the next release there. > > I hope this helps, > -greg > > > On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < > alexis.parenty.h...@gmail.com> wrote: > >> Dear RDKit community, >> >> I need to screen for substructure relationships between two sets of >> structures (1 000 X 500 000): I thought I should build two lists of mol >> objects from SMILES, but I keep having a memory error when the second list >> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my >> virtual memory. >> >> Do I really have to compromise on speed and make mol object on the flight >> from two lists of SMILES? Is there another memory efficient way to store >> mol object? >> >> Best, >> >> Alexis >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
While not multithreaded (yet) this is the use case of the filter catalog: http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1 Look for the SmartsMatcher class in the blog. It is a good idea to make this multithreaded as well, I'll add this as a possible enhancement. Brian Kelley > On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: > > Hi Alexis, > > I would approach this by loading the 1000 queries into a list of molecules > and then "stream" the others past that (so that you never attempt to load the > full 500K set at once). > > Here's a quick sketch of one way to do this: > In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x > is not None] > > In [5]: matches = [] > > In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >...: if m is None: >...: continue >...: matches.append([m.HasSubstructMatch(q) for q in queries]) >...: > > > Brian has some thoughts on making this particular use case easier/faster (in > particular by adding multi-threading support), so maybe there will be > something in the next release there. > > I hope this helps, > -greg > > >> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty >> wrote: >> Dear RDKit community, >> >> I need to screen for substructure relationships between two sets of >> structures (1 000 X 500 000): I thought I should build two lists of mol >> objects from SMILES, but I keep having a memory error when the second list >> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my >> virtual memory. >> >> Do I really have to compromise on speed and make mol object on the flight >> from two lists of SMILES? Is there another memory efficient way to store mol >> object? >> >> Best, >> >> Alexis >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Hi Alexis, I would approach this by loading the 1000 queries into a list of molecules and then "stream" the others past that (so that you never attempt to load the full 500K set at once). Here's a quick sketch of one way to do this: In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] In [5]: matches = [] In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): ...: if m is None: ...: continue ...: matches.append([m.HasSubstructMatch(q) for q in queries]) ...: Brian has some thoughts on making this particular use case easier/faster (in particular by adding multi-threading support), so maybe there will be something in the next release there. I hope this helps, -greg On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Dear RDKit community, > > I need to screen for substructure relationships between two sets of > structures (1 000 X 500 000): I thought I should build two lists of mol > objects from SMILES, but I keep having a memory error when the second list > reaches 300 000 mol. All my RAM (12G) gets consumed along with all my > virtual memory. > > Do I really have to compromise on speed and make mol object on the flight > from two lists of SMILES? Is there another memory efficient way to store > mol object? > > Best, > > Alexis > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along with all my virtual memory. Do I really have to compromise on speed and make mol object on the flight from two lists of SMILES? Is there another memory efficient way to store mol object? Best, Alexis -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss