Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
On 2017-06-09 08:12, Alexis Parenty wrote: Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor the available RAM and only start processing the matrix and clearing the list when less than 20% of RAM is left. This way, the best machines could skip the clearing process and gain time. What do you think? Take $100, buy a 200GB SSD, set it up as the swap space, don't worry about the RAM. Dima -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Yes Greg, this is what I am doing. You’re right, I did not think of the possibility to build a list of mol from the shorter list and process each of its mol with the mol of the longer list (which I would make on the flight from the smiles). However, I wanted to store the longest list of structures in order to access it again later for new substructure search from single structure at a time… It seemed silly to have to rebuild mol object from a 500K list of smiles every time I need to do a new substructure search. But your approach is going to help me a lot for the batch mode search I wanted to do. Best, Alexis On 9 June 2017 at 15:42, Greg Landrumwrote: > Hi Alexis, > > If I understand your use case correctly, you really don't need this level > of complication. > > If you are comparing Q molecules to M molecules and M>>Q (in the > discussion so far Q = 1000, M = 50) and you only need to compare each > of the Qs to each of the Ms a single time, you can safely construct all the > Q molecules and store them in memory and then loop over the Ms individually > and compare them to each of the Qs (this is what I did in my little > sample). This will have more or less exactly the same performance as > reading all of the Ms at once and then processing them. > > so, on a machine with infinite memory these two snippets will take more or > less the same amount of time to execute: > > low memory usage: > > queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is > not None] > matches = [] > for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): > if m is None: > continue > matches.append([m.HasSubstructMatch(q) for q in queries]) > > > > high memory usage: > > queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is > not None] > mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not > None] > matches = [] > for m in mols: > if m is None: > continue > matches.append([m.HasSubstructMatch(q) for q in queries]) > > > > The second form consumes a lot more memory without delivering any > improvement in performance. > > Best, > -greg > > > On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty < > alexis.parenty.h...@gmail.com> wrote: > >> Hi again, FYI here is the memory monitoring in attachment. Thanks, >> >> Alexis >> >> On 9 June 2017 at 15:12, Alexis Parenty >> wrote: >> >>> Dear Greg and Brian, >>> Many thanks for your response. I was also thinking of your streaming >>> approach! I think the RAM of most machine would deal with lists of 100K mol >>> so we could put the threshold higher than 1000. Actually, I was thinking to >>> monitor the available RAM and only start processing the matrix and clearing >>> the list when less than 20% of RAM is left. This way, the best machines >>> could skip the clearing process and gain time. What do you think? >>> >>> >>> Best, >>> >>> Alexis >>> >>> >>> >>> >>> >>> On 9 June 2017 at 14:40, Brian Kelley wrote: >>> While not multithreaded (yet) this is the use case of the filter catalog: http://rdkit.blogspot.com/2016/04/changes-in-201603-release- filtercatalog.html?m=1 Look for the SmartsMatcher class in the blog. It is a good idea to make this multithreaded as well, I'll add this as a possible enhancement. Brian Kelley On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: Hi Alexis, I would approach this by loading the 1000 queries into a list of molecules and then "stream" the others past that (so that you never attempt to load the full 500K set at once). Here's a quick sketch of one way to do this: In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] In [5]: matches = [] In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): ...: if m is None: ...: continue ...: matches.append([m.HasSubstructMatch(q) for q in queries]) ...: Brian has some thoughts on making this particular use case easier/faster (in particular by adding multi-threading support), so maybe there will be something in the next release there. I hope this helps, -greg On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Dear RDKit community, > > I need to screen for substructure relationships between two sets of > structures (1 000 X 500 000): I thought I should build two lists of mol > objects from SMILES, but I keep having a memory error when the second list > reaches 300 000 mol. All my RAM (12G) gets consumed along with all my > virtual memory. > > Do I really have to compromise on speed and make mol object on the > flight from two lists of SMILES?
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Hi Alexis, If I understand your use case correctly, you really don't need this level of complication. If you are comparing Q molecules to M molecules and M>>Q (in the discussion so far Q = 1000, M = 50) and you only need to compare each of the Qs to each of the Ms a single time, you can safely construct all the Q molecules and store them in memory and then loop over the Ms individually and compare them to each of the Qs (this is what I did in my little sample). This will have more or less exactly the same performance as reading all of the Ms at once and then processing them. so, on a machine with infinite memory these two snippets will take more or less the same amount of time to execute: low memory usage: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] matches = [] for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): if m is None: continue matches.append([m.HasSubstructMatch(q) for q in queries]) high memory usage: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not None] matches = [] for m in mols: if m is None: continue matches.append([m.HasSubstructMatch(q) for q in queries]) The second form consumes a lot more memory without delivering any improvement in performance. Best, -greg On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Hi again, FYI here is the memory monitoring in attachment. Thanks, > > Alexis > > On 9 June 2017 at 15:12, Alexis Parenty> wrote: > >> Dear Greg and Brian, >> Many thanks for your response. I was also thinking of your streaming >> approach! I think the RAM of most machine would deal with lists of 100K mol >> so we could put the threshold higher than 1000. Actually, I was thinking to >> monitor the available RAM and only start processing the matrix and clearing >> the list when less than 20% of RAM is left. This way, the best machines >> could skip the clearing process and gain time. What do you think? >> >> >> Best, >> >> Alexis >> >> >> >> >> >> On 9 June 2017 at 14:40, Brian Kelley wrote: >> >>> While not multithreaded (yet) this is the use case of the filter catalog: >>> >>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release- >>> filtercatalog.html?m=1 >>> >>> Look for the SmartsMatcher class in the blog. >>> >>> It is a good idea to make this multithreaded as well, I'll add this as a >>> possible enhancement. >>> >>> >>> Brian Kelley >>> >>> On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: >>> >>> Hi Alexis, >>> >>> I would approach this by loading the 1000 queries into a list of >>> molecules and then "stream" the others past that (so that you never attempt >>> to load the full 500K set at once). >>> >>> Here's a quick sketch of one way to do this: >>> >>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') >>> if x is not None] >>> >>> In [5]: matches = [] >>> >>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >>>...: if m is None: >>>...: continue >>>...: matches.append([m.HasSubstructMatch(q) for q in queries]) >>>...: >>> >>> >>> >>> Brian has some thoughts on making this particular use case easier/faster >>> (in particular by adding multi-threading support), so maybe there will be >>> something in the next release there. >>> >>> I hope this helps, >>> -greg >>> >>> >>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < >>> alexis.parenty.h...@gmail.com> wrote: >>> Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along with all my virtual memory. Do I really have to compromise on speed and make mol object on the flight from two lists of SMILES? Is there another memory efficient way to store mol object? Best, Alexis -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
What exactly are you doing? Is this 1000x500k substructure queries or something different? Brian Kelley > On Jun 9, 2017, at 9:12 AM, Alexis Parenty> wrote: > > Dear Greg and Brian, > Many thanks for your response. I was also thinking of your streaming > approach! I think the RAM of most machine would deal with lists of 100K mol > so we could put the threshold higher than 1000. Actually, I was thinking to > monitor the available RAM and only start processing the matrix and clearing > the list when less than 20% of RAM is left. This way, the best machines could > skip the clearing process and gain time. What do you think? > > > Best, > > Alexis > > > > > >> On 9 June 2017 at 14:40, Brian Kelley wrote: >> While not multithreaded (yet) this is the use case of the filter catalog: >> >> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1 >> >> Look for the SmartsMatcher class in the blog. >> >> It is a good idea to make this multithreaded as well, I'll add this as a >> possible enhancement. >> >> >> Brian Kelley >> >>> On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: >>> >>> Hi Alexis, >>> >>> I would approach this by loading the 1000 queries into a list of molecules >>> and then "stream" the others past that (so that you never attempt to load >>> the full 500K set at once). >>> >>> Here's a quick sketch of one way to do this: >>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if >>> x is not None] >>> >>> In [5]: matches = [] >>> >>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >>>...: if m is None: >>>...: continue >>>...: matches.append([m.HasSubstructMatch(q) for q in queries]) >>>...: >>> >>> >>> Brian has some thoughts on making this particular use case easier/faster >>> (in particular by adding multi-threading support), so maybe there will be >>> something in the next release there. >>> >>> I hope this helps, >>> -greg >>> >>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty wrote: Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along with all my virtual memory. Do I really have to compromise on speed and make mol object on the flight from two lists of SMILES? Is there another memory efficient way to store mol object? Best, Alexis -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor the available RAM and only start processing the matrix and clearing the list when less than 20% of RAM is left. This way, the best machines could skip the clearing process and gain time. What do you think? Best, Alexis On 9 June 2017 at 14:40, Brian Kelleywrote: > While not multithreaded (yet) this is the use case of the filter catalog: > > http://rdkit.blogspot.com/2016/04/changes-in-201603- > release-filtercatalog.html?m=1 > > Look for the SmartsMatcher class in the blog. > > It is a good idea to make this multithreaded as well, I'll add this as a > possible enhancement. > > > Brian Kelley > > On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: > > Hi Alexis, > > I would approach this by loading the 1000 queries into a list of molecules > and then "stream" the others past that (so that you never attempt to load > the full 500K set at once). > > Here's a quick sketch of one way to do this: > > In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') > if x is not None] > > In [5]: matches = [] > > In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >...: if m is None: >...: continue >...: matches.append([m.HasSubstructMatch(q) for q in queries]) >...: > > > > Brian has some thoughts on making this particular use case easier/faster > (in particular by adding multi-threading support), so maybe there will be > something in the next release there. > > I hope this helps, > -greg > > > On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < > alexis.parenty.h...@gmail.com> wrote: > >> Dear RDKit community, >> >> I need to screen for substructure relationships between two sets of >> structures (1 000 X 500 000): I thought I should build two lists of mol >> objects from SMILES, but I keep having a memory error when the second list >> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my >> virtual memory. >> >> Do I really have to compromise on speed and make mol object on the flight >> from two lists of SMILES? Is there another memory efficient way to store >> mol object? >> >> Best, >> >> Alexis >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
While not multithreaded (yet) this is the use case of the filter catalog: http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1 Look for the SmartsMatcher class in the blog. It is a good idea to make this multithreaded as well, I'll add this as a possible enhancement. Brian Kelley > On Jun 9, 2017, at 7:04 AM, Greg Landrumwrote: > > Hi Alexis, > > I would approach this by loading the 1000 queries into a list of molecules > and then "stream" the others past that (so that you never attempt to load the > full 500K set at once). > > Here's a quick sketch of one way to do this: > In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x > is not None] > > In [5]: matches = [] > > In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >...: if m is None: >...: continue >...: matches.append([m.HasSubstructMatch(q) for q in queries]) >...: > > > Brian has some thoughts on making this particular use case easier/faster (in > particular by adding multi-threading support), so maybe there will be > something in the next release there. > > I hope this helps, > -greg > > >> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty >> wrote: >> Dear RDKit community, >> >> I need to screen for substructure relationships between two sets of >> structures (1 000 X 500 000): I thought I should build two lists of mol >> objects from SMILES, but I keep having a memory error when the second list >> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my >> virtual memory. >> >> Do I really have to compromise on speed and make mol object on the flight >> from two lists of SMILES? Is there another memory efficient way to store mol >> object? >> >> Best, >> >> Alexis >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Hi Alexis, I would approach this by loading the 1000 queries into a list of molecules and then "stream" the others past that (so that you never attempt to load the full 500K set at once). Here's a quick sketch of one way to do this: In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] In [5]: matches = [] In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): ...: if m is None: ...: continue ...: matches.append([m.HasSubstructMatch(q) for q in queries]) ...: Brian has some thoughts on making this particular use case easier/faster (in particular by adding multi-threading support), so maybe there will be something in the next release there. I hope this helps, -greg On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Dear RDKit community, > > I need to screen for substructure relationships between two sets of > structures (1 000 X 500 000): I thought I should build two lists of mol > objects from SMILES, but I keep having a memory error when the second list > reaches 300 000 mol. All my RAM (12G) gets consumed along with all my > virtual memory. > > Do I really have to compromise on speed and make mol object on the flight > from two lists of SMILES? Is there another memory efficient way to store > mol object? > > Best, > > Alexis > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss