Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity
Thanks for the reference. That sort of bounds screening would probably work well in the C++ layer for the bulk similarity functions. My initial experiments without bounds screening found that doing individual similarity calculations in Python was a lot slower than the bulk function because moving from the Python to the C++ layer has quite a large overhead. It might be worth doing the screening in Python and I will try it but I suspect that the overhead will cancel out any gain brought by the screening process. On Wed, 26 Oct 2022 at 04:17, S Joshua Swamidass wrote: > > I wonder if there is a way to make use of PyTorch or tensorflow to do this > on a GPU. That’s where some big speed ups might be found. > > Also, consider using these bounds. They do make a big difference in many > cases. > > https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527184/ > > > On Tue, Oct 25, 2022 at 8:57 PM Francois Berenger > wrote: > >> On 24/10/2022 19:47, David Cosgrove wrote: >> > For the record, I have attempted this, but got only a marginal >> > speed-up (130% of CPU used, with any number of threads above 2). The >> > procedure I used was to extract the fingerprint pointers into a >> > std::vector, create a std::vector for the results, unlock the GIL to >> > do the bulk tanimoto calculation, then re-lock the GIL to copy the >> > results from the std::vector into the python:list for output. I guess >> > the extra overhead to create and populate the additional std::vectors >> > destroyed any potential speedup. This was on a vector of 200K >> > fingerprints, which suggests that the Tanimoto calculation is a small >> > part of the overall time. It doesn't seem worth pursuing further. >> >> There is probably code on github doing this in parallel already. >> Think about it: any clustering algorithm using a distance matrix. >> I guess many people want to initialize the Gram matrix in parallel. >> >> I wouldn't be surprised if, for example, chemfp has such code. >> >> > Dave >> > >> > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove >> > wrote: >> > >> >> Hi Greg, >> >> Thanks for the pointer. I’ll take a look. If it could go in the >> >> next patch release that would be really useful. >> >> Dave >> >> >> >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum >> >> wrote: >> >> >> >> Hi Dave, >> >> >> >> We have multiple examples of this in the code, here’s one: >> >> >> >> >> > >> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 >> >> >> >> I’m not sure how this would interact with the call to >> >> Python::extract that’s in the bulk functions though >> >> >> >> It might be better to handle the multithreading on the C++ side by >> >> adding an optional nThreads argument to the bulk similarity >> >> functions. (Though this would have to wait for the next release >> >> since it’s a feature addition… we can declare releasing the GIL >> >> as a bug fix) >> >> >> >> -greg >> >> >> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove >> >> wrote: >> >> >> >> Hi, >> >> >> >> I'm doing a lot of tanimoto similarity calculations on large >> >> datasets using BulkTanimotoSimilarity. It is an obvious candidate >> >> for parallelisation, so I am using concurrent.futures to do so. If >> >> I use ProcessPoolExectuor, I get good speed-up but each process >> >> needs a copy of the fingerprint set and for the sizes I'm dealing >> >> with that uses too much memory. With ThreadPoolExecutor I only need >> >> 1 copy of the fingerprints, but the GIL means it only runs on 1 >> >> thread at a time so there's no gain. Would it be possible to amend >> >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the >> >> calculation, and recapture it afterwards? I understand things like >> >> numpy do this for some of their functions. I'm happy to attempt it >> >> myself if someone who knows about these things can advise that it >> >> could be done, it would help, and they could provide a few pointers. >> >> >> >> Thanks, >> >> Dave >> >> >> >> -- >> >> >> >> David Cosgrove >> >> Freelance computational chemistry and chemoinformatics developer >> >> http://cozchemix.co.uk >> >> >> >> ___ >> >> Rdkit-discuss mailing list >> >> Rdkit-discuss@lists.sourceforge.net >> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > -- >> > >> > David Cosgrove >> > Freelance computational chemistry and chemoinformatics developer >> > http://cozchemix.co.uk >> > >> > -- >> > >> > David Cosgrove >> > Freelance computational chemistry and chemoinformatics developer >> > http://cozchemix.co.uk >> > ___ >> > Rdkit-discuss mailing list >> > Rdkit-discuss@lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >>
Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity
I would be very surprised if speed of fingerprint similarity was the limiting factor on a distance- matrix-based clustering method. Normally they are constrained by memory requirements. In this case I am using the MaxMin picker in RDKit to generate the cluster “centroids” and am wanting to fill the clusters with the neighbours to each “centroid”. That is obviously an embarrassingly parallel calculation, but the size of my dataset means that using standard Python multi-processing takes up too much memory because each parallel process needs its own copy of the fingerprint set. I was wondering if releasing the GIL would make it possible to do it in multithreaded mode with a single shared fingerprint set but the answer was that it made little improvement on a single-thread run. On Wed, 26 Oct 2022 at 02:32, Francois Berenger wrote: > On 24/10/2022 19:47, David Cosgrove wrote: > > For the record, I have attempted this, but got only a marginal > > speed-up (130% of CPU used, with any number of threads above 2). The > > procedure I used was to extract the fingerprint pointers into a > > std::vector, create a std::vector for the results, unlock the GIL to > > do the bulk tanimoto calculation, then re-lock the GIL to copy the > > results from the std::vector into the python:list for output. I guess > > the extra overhead to create and populate the additional std::vectors > > destroyed any potential speedup. This was on a vector of 200K > > fingerprints, which suggests that the Tanimoto calculation is a small > > part of the overall time. It doesn't seem worth pursuing further. > > There is probably code on github doing this in parallel already. > Think about it: any clustering algorithm using a distance matrix. > I guess many people want to initialize the Gram matrix in parallel. > > I wouldn't be surprised if, for example, chemfp has such code. > > > Dave > > > > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove > > wrote: > > > >> Hi Greg, > >> Thanks for the pointer. I’ll take a look. If it could go in the > >> next patch release that would be really useful. > >> Dave > >> > >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum > >> wrote: > >> > >> Hi Dave, > >> > >> We have multiple examples of this in the code, here’s one: > >> > >> > > > https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 > >> > >> I’m not sure how this would interact with the call to > >> Python::extract that’s in the bulk functions though > >> > >> It might be better to handle the multithreading on the C++ side by > >> adding an optional nThreads argument to the bulk similarity > >> functions. (Though this would have to wait for the next release > >> since it’s a feature addition… we can declare releasing the GIL > >> as a bug fix) > >> > >> -greg > >> > >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove > >> wrote: > >> > >> Hi, > >> > >> I'm doing a lot of tanimoto similarity calculations on large > >> datasets using BulkTanimotoSimilarity. It is an obvious candidate > >> for parallelisation, so I am using concurrent.futures to do so. If > >> I use ProcessPoolExectuor, I get good speed-up but each process > >> needs a copy of the fingerprint set and for the sizes I'm dealing > >> with that uses too much memory. With ThreadPoolExecutor I only need > >> 1 copy of the fingerprints, but the GIL means it only runs on 1 > >> thread at a time so there's no gain. Would it be possible to amend > >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the > >> calculation, and recapture it afterwards? I understand things like > >> numpy do this for some of their functions. I'm happy to attempt it > >> myself if someone who knows about these things can advise that it > >> could be done, it would help, and they could provide a few pointers. > >> > >> Thanks, > >> Dave > >> > >> -- > >> > >> David Cosgrove > >> Freelance computational chemistry and chemoinformatics developer > >> http://cozchemix.co.uk > >> > >> ___ > >> Rdkit-discuss mailing list > >> Rdkit-discuss@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- > > > > David Cosgrove > > Freelance computational chemistry and chemoinformatics developer > > http://cozchemix.co.uk > > > > -- > > > > David Cosgrove > > Freelance computational chemistry and chemoinformatics developer > > http://cozchemix.co.uk > > ___ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity
I wonder if there is a way to make use of PyTorch or tensorflow to do this on a GPU. That’s where some big speed ups might be found. Also, consider using these bounds. They do make a big difference in many cases. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527184/ On Tue, Oct 25, 2022 at 8:57 PM Francois Berenger wrote: > On 24/10/2022 19:47, David Cosgrove wrote: > > For the record, I have attempted this, but got only a marginal > > speed-up (130% of CPU used, with any number of threads above 2). The > > procedure I used was to extract the fingerprint pointers into a > > std::vector, create a std::vector for the results, unlock the GIL to > > do the bulk tanimoto calculation, then re-lock the GIL to copy the > > results from the std::vector into the python:list for output. I guess > > the extra overhead to create and populate the additional std::vectors > > destroyed any potential speedup. This was on a vector of 200K > > fingerprints, which suggests that the Tanimoto calculation is a small > > part of the overall time. It doesn't seem worth pursuing further. > > There is probably code on github doing this in parallel already. > Think about it: any clustering algorithm using a distance matrix. > I guess many people want to initialize the Gram matrix in parallel. > > I wouldn't be surprised if, for example, chemfp has such code. > > > Dave > > > > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove > > wrote: > > > >> Hi Greg, > >> Thanks for the pointer. I’ll take a look. If it could go in the > >> next patch release that would be really useful. > >> Dave > >> > >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum > >> wrote: > >> > >> Hi Dave, > >> > >> We have multiple examples of this in the code, here’s one: > >> > >> > > > https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 > >> > >> I’m not sure how this would interact with the call to > >> Python::extract that’s in the bulk functions though > >> > >> It might be better to handle the multithreading on the C++ side by > >> adding an optional nThreads argument to the bulk similarity > >> functions. (Though this would have to wait for the next release > >> since it’s a feature addition… we can declare releasing the GIL > >> as a bug fix) > >> > >> -greg > >> > >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove > >> wrote: > >> > >> Hi, > >> > >> I'm doing a lot of tanimoto similarity calculations on large > >> datasets using BulkTanimotoSimilarity. It is an obvious candidate > >> for parallelisation, so I am using concurrent.futures to do so. If > >> I use ProcessPoolExectuor, I get good speed-up but each process > >> needs a copy of the fingerprint set and for the sizes I'm dealing > >> with that uses too much memory. With ThreadPoolExecutor I only need > >> 1 copy of the fingerprints, but the GIL means it only runs on 1 > >> thread at a time so there's no gain. Would it be possible to amend > >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the > >> calculation, and recapture it afterwards? I understand things like > >> numpy do this for some of their functions. I'm happy to attempt it > >> myself if someone who knows about these things can advise that it > >> could be done, it would help, and they could provide a few pointers. > >> > >> Thanks, > >> Dave > >> > >> -- > >> > >> David Cosgrove > >> Freelance computational chemistry and chemoinformatics developer > >> http://cozchemix.co.uk > >> > >> ___ > >> Rdkit-discuss mailing list > >> Rdkit-discuss@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- > > > > David Cosgrove > > Freelance computational chemistry and chemoinformatics developer > > http://cozchemix.co.uk > > > > -- > > > > David Cosgrove > > Freelance computational chemistry and chemoinformatics developer > > http://cozchemix.co.uk > > ___ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Sent from Gmail Mobile ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity
On 24/10/2022 19:47, David Cosgrove wrote: For the record, I have attempted this, but got only a marginal speed-up (130% of CPU used, with any number of threads above 2). The procedure I used was to extract the fingerprint pointers into a std::vector, create a std::vector for the results, unlock the GIL to do the bulk tanimoto calculation, then re-lock the GIL to copy the results from the std::vector into the python:list for output. I guess the extra overhead to create and populate the additional std::vectors destroyed any potential speedup. This was on a vector of 200K fingerprints, which suggests that the Tanimoto calculation is a small part of the overall time. It doesn't seem worth pursuing further. There is probably code on github doing this in parallel already. Think about it: any clustering algorithm using a distance matrix. I guess many people want to initialize the Gram matrix in parallel. I wouldn't be surprised if, for example, chemfp has such code. Dave On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove wrote: Hi Greg, Thanks for the pointer. I’ll take a look. If it could go in the next patch release that would be really useful. Dave On Sat, 22 Oct 2022 at 10:52, Greg Landrum wrote: Hi Dave, We have multiple examples of this in the code, here’s one: https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 I’m not sure how this would interact with the call to Python::extract that’s in the bulk functions though It might be better to handle the multithreading on the C++ side by adding an optional nThreads argument to the bulk similarity functions. (Though this would have to wait for the next release since it’s a feature addition… we can declare releasing the GIL as a bug fix) -greg On Sat, 22 Oct 2022 at 09:48, David Cosgrove wrote: Hi, I'm doing a lot of tanimoto similarity calculations on large datasets using BulkTanimotoSimilarity. It is an obvious candidate for parallelisation, so I am using concurrent.futures to do so. If I use ProcessPoolExectuor, I get good speed-up but each process needs a copy of the fingerprint set and for the sizes I'm dealing with that uses too much memory. With ThreadPoolExecutor I only need 1 copy of the fingerprints, but the GIL means it only runs on 1 thread at a time so there's no gain. Would it be possible to amend the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the calculation, and recapture it afterwards? I understand things like numpy do this for some of their functions. I'm happy to attempt it myself if someone who knows about these things can advise that it could be done, it would help, and they could provide a few pointers. Thanks, Dave -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity
For the record, I have attempted this, but got only a marginal speed-up (130% of CPU used, with any number of threads above 2). The procedure I used was to extract the fingerprint pointers into a std::vector, create a std::vector for the results, unlock the GIL to do the bulk tanimoto calculation, then re-lock the GIL to copy the results from the std::vector into the python:list for output. I guess the extra overhead to create and populate the additional std::vectors destroyed any potential speedup. This was on a vector of 200K fingerprints, which suggests that the Tanimoto calculation is a small part of the overall time. It doesn't seem worth pursuing further. Dave On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove wrote: > Hi Greg, > Thanks for the pointer. I’ll take a look. If it could go in the next patch > release that would be really useful. > Dave > > > On Sat, 22 Oct 2022 at 10:52, Greg Landrum wrote: > >> >> Hi Dave, >> >> We have multiple examples of this in the code, here’s one: >> >> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 >> >> I’m not sure how this would interact with the call to Python::extract >> that’s in the bulk functions though >> >> It might be better to handle the multithreading on the C++ side by adding >> an optional nThreads argument to the bulk similarity functions. (Though >> this would have to wait for the next release since it’s a feature addition… >> we can declare releasing the GIL as a bug fix) >> >> -greg >> >> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove >> wrote: >> >>> Hi, >>> >>> I'm doing a lot of tanimoto similarity calculations on large datasets >>> using BulkTanimotoSimilarity. It is an obvious candidate for >>> parallelisation, so I am using concurrent.futures to do so. If I use >>> ProcessPoolExectuor, I get good speed-up but each process needs a copy of >>> the fingerprint set and for the sizes I'm dealing with that uses too much >>> memory. With ThreadPoolExecutor I only need 1 copy of the fingerprints, >>> but the GIL means it only runs on 1 thread at a time so there's no gain. >>> Would it be possible to amend the C++ BulkTanimotoSimilarity to free the >>> GIL whilst it's doing the calculation, and recapture it afterwards? I >>> understand things like numpy do this for some of their functions. I'm >>> happy to attempt it myself if someone who knows about these things can >>> advise that it could be done, it would help, and they could provide a few >>> pointers. >>> >>> Thanks, >>> Dave >>> >>> >>> -- >>> David Cosgrove >>> Freelance computational chemistry and chemoinformatics developer >>> http://cozchemix.co.uk >>> >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > http://cozchemix.co.uk > > -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity
Hi Greg, Thanks for the pointer. I’ll take a look. If it could go in the next patch release that would be really useful. Dave On Sat, 22 Oct 2022 at 10:52, Greg Landrum wrote: > > Hi Dave, > > We have multiple examples of this in the code, here’s one: > > https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 > > I’m not sure how this would interact with the call to Python::extract > that’s in the bulk functions though > > It might be better to handle the multithreading on the C++ side by adding > an optional nThreads argument to the bulk similarity functions. (Though > this would have to wait for the next release since it’s a feature addition… > we can declare releasing the GIL as a bug fix) > > -greg > > > On Sat, 22 Oct 2022 at 09:48, David Cosgrove > wrote: > >> Hi, >> >> I'm doing a lot of tanimoto similarity calculations on large datasets >> using BulkTanimotoSimilarity. It is an obvious candidate for >> parallelisation, so I am using concurrent.futures to do so. If I use >> ProcessPoolExectuor, I get good speed-up but each process needs a copy of >> the fingerprint set and for the sizes I'm dealing with that uses too much >> memory. With ThreadPoolExecutor I only need 1 copy of the fingerprints, >> but the GIL means it only runs on 1 thread at a time so there's no gain. >> Would it be possible to amend the C++ BulkTanimotoSimilarity to free the >> GIL whilst it's doing the calculation, and recapture it afterwards? I >> understand things like numpy do this for some of their functions. I'm >> happy to attempt it myself if someone who knows about these things can >> advise that it could be done, it would help, and they could provide a few >> pointers. >> >> Thanks, >> Dave >> >> >> -- >> David Cosgrove >> Freelance computational chemistry and chemoinformatics developer >> http://cozchemix.co.uk >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity
Hi Dave, We have multiple examples of this in the code, here’s one: https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 I’m not sure how this would interact with the call to Python::extract that’s in the bulk functions though It might be better to handle the multithreading on the C++ side by adding an optional nThreads argument to the bulk similarity functions. (Though this would have to wait for the next release since it’s a feature addition… we can declare releasing the GIL as a bug fix) -greg On Sat, 22 Oct 2022 at 09:48, David Cosgrove wrote: > Hi, > > I'm doing a lot of tanimoto similarity calculations on large datasets > using BulkTanimotoSimilarity. It is an obvious candidate for > parallelisation, so I am using concurrent.futures to do so. If I use > ProcessPoolExectuor, I get good speed-up but each process needs a copy of > the fingerprint set and for the sizes I'm dealing with that uses too much > memory. With ThreadPoolExecutor I only need 1 copy of the fingerprints, > but the GIL means it only runs on 1 thread at a time so there's no gain. > Would it be possible to amend the C++ BulkTanimotoSimilarity to free the > GIL whilst it's doing the calculation, and recapture it afterwards? I > understand things like numpy do this for some of their functions. I'm > happy to attempt it myself if someone who knows about these things can > advise that it could be done, it would help, and they could provide a few > pointers. > > Thanks, > Dave > > > -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > http://cozchemix.co.uk > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity
Hi, I'm doing a lot of tanimoto similarity calculations on large datasets using BulkTanimotoSimilarity. It is an obvious candidate for parallelisation, so I am using concurrent.futures to do so. If I use ProcessPoolExectuor, I get good speed-up but each process needs a copy of the fingerprint set and for the sizes I'm dealing with that uses too much memory. With ThreadPoolExecutor I only need 1 copy of the fingerprints, but the GIL means it only runs on 1 thread at a time so there's no gain. Would it be possible to amend the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the calculation, and recapture it afterwards? I understand things like numpy do this for some of their functions. I'm happy to attempt it myself if someone who knows about these things can advise that it could be done, it would help, and they could provide a few pointers. Thanks, Dave -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss