Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

2022-10-26 Thread David Cosgrove
Thanks for the reference. That sort of bounds screening would probably work
well in the C++ layer for the bulk similarity functions.  My initial
experiments without bounds screening found that doing individual similarity
calculations in Python was a lot slower than the bulk function because
moving from the Python to the C++ layer has quite a large overhead. It
might be worth doing the screening in Python and I will try it but I
suspect that the overhead will cancel out any gain brought by the screening
process.

On Wed, 26 Oct 2022 at 04:17, S Joshua Swamidass 
wrote:

>
> I wonder if there is a way to make use of PyTorch or tensorflow to do this
> on a GPU. That’s where some big speed ups might be found.
>
> Also, consider using these bounds. They do make a big difference in many
> cases.
>
> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527184/
>
>
> On Tue, Oct 25, 2022 at 8:57 PM Francois Berenger 
> wrote:
>
>> On 24/10/2022 19:47, David Cosgrove wrote:
>> > For the record, I have attempted this, but got only a marginal
>> > speed-up (130% of CPU used, with any number of threads above 2).  The
>> > procedure I used was to extract the fingerprint pointers into a
>> > std::vector, create a std::vector for the results, unlock the GIL to
>> > do the bulk tanimoto calculation, then re-lock the GIL to copy the
>> > results from the std::vector into the python:list for output.  I guess
>> > the extra overhead to create and populate the additional std::vectors
>> > destroyed any potential speedup.  This was on a vector of 200K
>> > fingerprints, which suggests that the Tanimoto calculation is a small
>> > part of the overall time.  It doesn't seem worth pursuing further.
>>
>> There is probably code on github doing this in parallel already.
>> Think about it: any clustering algorithm using a distance matrix.
>> I guess many people want to initialize the Gram matrix in parallel.
>>
>> I wouldn't be surprised if, for example, chemfp has such code.
>>
>> > Dave
>> >
>> > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove
>> >  wrote:
>> >
>> >> Hi Greg,
>> >> Thanks for the pointer. I’ll take a look. If it could go in the
>> >> next patch release that would be really useful.
>> >> Dave
>> >>
>> >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum 
>> >> wrote:
>> >>
>> >> Hi Dave,
>> >>
>> >> We have multiple examples of this in the code, here’s one:
>> >>
>> >>
>> >
>> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40
>> >>
>> >> I’m not sure how this would interact with the call to
>> >> Python::extract that’s in the bulk functions though
>> >>
>> >> It might be better to handle the multithreading on the C++ side by
>> >> adding an optional nThreads argument to  the bulk similarity
>> >> functions. (Though this would have to wait for the next release
>> >> since it’s a feature addition… we can declare releasing the GIL
>> >> as a bug fix)
>> >>
>> >> -greg
>> >>
>> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove
>> >>  wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm doing a lot of tanimoto similarity calculations on large
>> >> datasets using BulkTanimotoSimilarity.  It is an obvious candidate
>> >> for parallelisation, so I am using concurrent.futures to do so.  If
>> >> I use ProcessPoolExectuor, I get good speed-up but each process
>> >> needs a copy of the fingerprint set and for the sizes I'm dealing
>> >> with that uses too much memory.  With ThreadPoolExecutor I only need
>> >> 1 copy of the fingerprints, but the GIL means it only runs on 1
>> >> thread at a time so there's no gain.  Would it be possible to amend
>> >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the
>> >> calculation, and recapture it afterwards?  I understand things like
>> >> numpy do this for some of their functions.  I'm happy to attempt it
>> >> myself if someone who knows about these things can advise that it
>> >> could be done, it would help, and they could provide a few pointers.
>> >>
>> >> Thanks,
>> >> Dave
>> >>
>> >> --
>> >>
>> >> David Cosgrove
>> >> Freelance computational chemistry and chemoinformatics developer
>> >> http://cozchemix.co.uk
>> >>
>> >> ___
>> >> Rdkit-discuss mailing list
>> >> Rdkit-discuss@lists.sourceforge.net
>> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> >  --
>> >
>> > David Cosgrove
>> > Freelance computational chemistry and chemoinformatics developer
>> > http://cozchemix.co.uk
>> >
>> > --
>> >
>> > David Cosgrove
>> > Freelance computational chemistry and chemoinformatics developer
>> > http://cozchemix.co.uk
>> > ___
>> > Rdkit-discuss mailing list
>> > Rdkit-discuss@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> 

Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

2022-10-26 Thread David Cosgrove
I would be very surprised if speed of fingerprint similarity was the
limiting factor on a distance- matrix-based clustering method. Normally
they are constrained by memory requirements.  In this case I am using the
MaxMin picker in RDKit to generate the cluster “centroids” and am wanting
to fill the clusters with the neighbours to each “centroid”. That is
obviously an embarrassingly parallel calculation, but the size of my
dataset means that using standard Python multi-processing takes up too much
memory because each parallel process needs its own copy of the fingerprint
set. I was wondering if releasing the GIL would make it possible to do it
in multithreaded mode with a single shared fingerprint set but the answer
was that it made little improvement on a single-thread run.


On Wed, 26 Oct 2022 at 02:32, Francois Berenger  wrote:

> On 24/10/2022 19:47, David Cosgrove wrote:
> > For the record, I have attempted this, but got only a marginal
> > speed-up (130% of CPU used, with any number of threads above 2).  The
> > procedure I used was to extract the fingerprint pointers into a
> > std::vector, create a std::vector for the results, unlock the GIL to
> > do the bulk tanimoto calculation, then re-lock the GIL to copy the
> > results from the std::vector into the python:list for output.  I guess
> > the extra overhead to create and populate the additional std::vectors
> > destroyed any potential speedup.  This was on a vector of 200K
> > fingerprints, which suggests that the Tanimoto calculation is a small
> > part of the overall time.  It doesn't seem worth pursuing further.
>
> There is probably code on github doing this in parallel already.
> Think about it: any clustering algorithm using a distance matrix.
> I guess many people want to initialize the Gram matrix in parallel.
>
> I wouldn't be surprised if, for example, chemfp has such code.
>
> > Dave
> >
> > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove
> >  wrote:
> >
> >> Hi Greg,
> >> Thanks for the pointer. I’ll take a look. If it could go in the
> >> next patch release that would be really useful.
> >> Dave
> >>
> >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum 
> >> wrote:
> >>
> >> Hi Dave,
> >>
> >> We have multiple examples of this in the code, here’s one:
> >>
> >>
> >
> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40
> >>
> >> I’m not sure how this would interact with the call to
> >> Python::extract that’s in the bulk functions though
> >>
> >> It might be better to handle the multithreading on the C++ side by
> >> adding an optional nThreads argument to  the bulk similarity
> >> functions. (Though this would have to wait for the next release
> >> since it’s a feature addition… we can declare releasing the GIL
> >> as a bug fix)
> >>
> >> -greg
> >>
> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove
> >>  wrote:
> >>
> >> Hi,
> >>
> >> I'm doing a lot of tanimoto similarity calculations on large
> >> datasets using BulkTanimotoSimilarity.  It is an obvious candidate
> >> for parallelisation, so I am using concurrent.futures to do so.  If
> >> I use ProcessPoolExectuor, I get good speed-up but each process
> >> needs a copy of the fingerprint set and for the sizes I'm dealing
> >> with that uses too much memory.  With ThreadPoolExecutor I only need
> >> 1 copy of the fingerprints, but the GIL means it only runs on 1
> >> thread at a time so there's no gain.  Would it be possible to amend
> >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the
> >> calculation, and recapture it afterwards?  I understand things like
> >> numpy do this for some of their functions.  I'm happy to attempt it
> >> myself if someone who knows about these things can advise that it
> >> could be done, it would help, and they could provide a few pointers.
> >>
> >> Thanks,
> >> Dave
> >>
> >> --
> >>
> >> David Cosgrove
> >> Freelance computational chemistry and chemoinformatics developer
> >> http://cozchemix.co.uk
> >>
> >> ___
> >> Rdkit-discuss mailing list
> >> Rdkit-discuss@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >  --
> >
> > David Cosgrove
> > Freelance computational chemistry and chemoinformatics developer
> > http://cozchemix.co.uk
> >
> > --
> >
> > David Cosgrove
> > Freelance computational chemistry and chemoinformatics developer
> > http://cozchemix.co.uk
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

2022-10-25 Thread S Joshua Swamidass
I wonder if there is a way to make use of PyTorch or tensorflow to do this
on a GPU. That’s where some big speed ups might be found.

Also, consider using these bounds. They do make a big difference in many
cases.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527184/


On Tue, Oct 25, 2022 at 8:57 PM Francois Berenger  wrote:

> On 24/10/2022 19:47, David Cosgrove wrote:
> > For the record, I have attempted this, but got only a marginal
> > speed-up (130% of CPU used, with any number of threads above 2).  The
> > procedure I used was to extract the fingerprint pointers into a
> > std::vector, create a std::vector for the results, unlock the GIL to
> > do the bulk tanimoto calculation, then re-lock the GIL to copy the
> > results from the std::vector into the python:list for output.  I guess
> > the extra overhead to create and populate the additional std::vectors
> > destroyed any potential speedup.  This was on a vector of 200K
> > fingerprints, which suggests that the Tanimoto calculation is a small
> > part of the overall time.  It doesn't seem worth pursuing further.
>
> There is probably code on github doing this in parallel already.
> Think about it: any clustering algorithm using a distance matrix.
> I guess many people want to initialize the Gram matrix in parallel.
>
> I wouldn't be surprised if, for example, chemfp has such code.
>
> > Dave
> >
> > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove
> >  wrote:
> >
> >> Hi Greg,
> >> Thanks for the pointer. I’ll take a look. If it could go in the
> >> next patch release that would be really useful.
> >> Dave
> >>
> >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum 
> >> wrote:
> >>
> >> Hi Dave,
> >>
> >> We have multiple examples of this in the code, here’s one:
> >>
> >>
> >
> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40
> >>
> >> I’m not sure how this would interact with the call to
> >> Python::extract that’s in the bulk functions though
> >>
> >> It might be better to handle the multithreading on the C++ side by
> >> adding an optional nThreads argument to  the bulk similarity
> >> functions. (Though this would have to wait for the next release
> >> since it’s a feature addition… we can declare releasing the GIL
> >> as a bug fix)
> >>
> >> -greg
> >>
> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove
> >>  wrote:
> >>
> >> Hi,
> >>
> >> I'm doing a lot of tanimoto similarity calculations on large
> >> datasets using BulkTanimotoSimilarity.  It is an obvious candidate
> >> for parallelisation, so I am using concurrent.futures to do so.  If
> >> I use ProcessPoolExectuor, I get good speed-up but each process
> >> needs a copy of the fingerprint set and for the sizes I'm dealing
> >> with that uses too much memory.  With ThreadPoolExecutor I only need
> >> 1 copy of the fingerprints, but the GIL means it only runs on 1
> >> thread at a time so there's no gain.  Would it be possible to amend
> >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the
> >> calculation, and recapture it afterwards?  I understand things like
> >> numpy do this for some of their functions.  I'm happy to attempt it
> >> myself if someone who knows about these things can advise that it
> >> could be done, it would help, and they could provide a few pointers.
> >>
> >> Thanks,
> >> Dave
> >>
> >> --
> >>
> >> David Cosgrove
> >> Freelance computational chemistry and chemoinformatics developer
> >> http://cozchemix.co.uk
> >>
> >> ___
> >> Rdkit-discuss mailing list
> >> Rdkit-discuss@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >  --
> >
> > David Cosgrove
> > Freelance computational chemistry and chemoinformatics developer
> > http://cozchemix.co.uk
> >
> > --
> >
> > David Cosgrove
> > Freelance computational chemistry and chemoinformatics developer
> > http://cozchemix.co.uk
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
Sent from Gmail Mobile
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

2022-10-25 Thread Francois Berenger

On 24/10/2022 19:47, David Cosgrove wrote:

For the record, I have attempted this, but got only a marginal
speed-up (130% of CPU used, with any number of threads above 2).  The
procedure I used was to extract the fingerprint pointers into a
std::vector, create a std::vector for the results, unlock the GIL to
do the bulk tanimoto calculation, then re-lock the GIL to copy the
results from the std::vector into the python:list for output.  I guess
the extra overhead to create and populate the additional std::vectors
destroyed any potential speedup.  This was on a vector of 200K
fingerprints, which suggests that the Tanimoto calculation is a small
part of the overall time.  It doesn't seem worth pursuing further.


There is probably code on github doing this in parallel already.
Think about it: any clustering algorithm using a distance matrix.
I guess many people want to initialize the Gram matrix in parallel.

I wouldn't be surprised if, for example, chemfp has such code.


Dave

On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove
 wrote:


Hi Greg,
Thanks for the pointer. I’ll take a look. If it could go in the
next patch release that would be really useful.
Dave

On Sat, 22 Oct 2022 at 10:52, Greg Landrum 
wrote:

Hi Dave,

We have multiple examples of this in the code, here’s one:



https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40


I’m not sure how this would interact with the call to
Python::extract that’s in the bulk functions though

It might be better to handle the multithreading on the C++ side by
adding an optional nThreads argument to  the bulk similarity
functions. (Though this would have to wait for the next release
since it’s a feature addition… we can declare releasing the GIL
as a bug fix)

-greg

On Sat, 22 Oct 2022 at 09:48, David Cosgrove
 wrote:

Hi,

I'm doing a lot of tanimoto similarity calculations on large
datasets using BulkTanimotoSimilarity.  It is an obvious candidate
for parallelisation, so I am using concurrent.futures to do so.  If
I use ProcessPoolExectuor, I get good speed-up but each process
needs a copy of the fingerprint set and for the sizes I'm dealing
with that uses too much memory.  With ThreadPoolExecutor I only need
1 copy of the fingerprints, but the GIL means it only runs on 1
thread at a time so there's no gain.  Would it be possible to amend
the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the
calculation, and recapture it afterwards?  I understand things like
numpy do this for some of their functions.  I'm happy to attempt it
myself if someone who knows about these things can advise that it
could be done, it would help, and they could provide a few pointers.

Thanks,
Dave

--

David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

 --

David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

--

David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

2022-10-24 Thread David Cosgrove
For the record, I have attempted this, but got only a marginal speed-up
(130% of CPU used, with any number of threads above 2).  The procedure I
used was to extract the fingerprint pointers into a std::vector, create a
std::vector for the results, unlock the GIL to do the bulk tanimoto
calculation, then re-lock the GIL to copy the results from the std::vector
into the python:list for output.  I guess the extra overhead to create and
populate the additional std::vectors destroyed any potential speedup.  This
was on a vector of 200K fingerprints, which suggests that the Tanimoto
calculation is a small part of the overall time.  It doesn't seem worth
pursuing further.

Dave


On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove 
wrote:

> Hi Greg,
> Thanks for the pointer. I’ll take a look. If it could go in the next patch
> release that would be really useful.
> Dave
>
>
> On Sat, 22 Oct 2022 at 10:52, Greg Landrum  wrote:
>
>>
>> Hi Dave,
>>
>> We have multiple examples of this in the code, here’s one:
>>
>> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40
>>
>> I’m not sure how this would interact with the call to Python::extract
>> that’s in the bulk functions though
>>
>> It might be better to handle the multithreading on the C++ side by adding
>> an optional nThreads argument to  the bulk similarity functions. (Though
>> this would have to wait for the next release since it’s a feature addition…
>> we can declare releasing the GIL as a bug fix)
>>
>> -greg
>>
>>
>> On Sat, 22 Oct 2022 at 09:48, David Cosgrove 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm doing a lot of tanimoto similarity calculations on large datasets
>>> using BulkTanimotoSimilarity.  It is an obvious candidate for
>>> parallelisation, so I am using concurrent.futures to do so.  If I use
>>> ProcessPoolExectuor, I get good speed-up but each process needs a copy of
>>> the fingerprint set and for the sizes I'm dealing with that uses too much
>>> memory.  With ThreadPoolExecutor I only need 1 copy of the fingerprints,
>>> but the GIL means it only runs on 1 thread at a time so there's no gain.
>>> Would it be possible to amend the C++ BulkTanimotoSimilarity to free the
>>> GIL whilst it's doing the calculation, and recapture it afterwards?  I
>>> understand things like numpy do this for some of their functions.  I'm
>>> happy to attempt it myself if someone who knows about these things can
>>> advise that it could be done, it would help, and they could provide a few
>>> pointers.
>>>
>>> Thanks,
>>> Dave
>>>
>>>
>>> --
>>> David Cosgrove
>>> Freelance computational chemistry and chemoinformatics developer
>>> http://cozchemix.co.uk
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>

-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

2022-10-22 Thread David Cosgrove
Hi Greg,
Thanks for the pointer. I’ll take a look. If it could go in the next patch
release that would be really useful.
Dave


On Sat, 22 Oct 2022 at 10:52, Greg Landrum  wrote:

>
> Hi Dave,
>
> We have multiple examples of this in the code, here’s one:
>
> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40
>
> I’m not sure how this would interact with the call to Python::extract
> that’s in the bulk functions though
>
> It might be better to handle the multithreading on the C++ side by adding
> an optional nThreads argument to  the bulk similarity functions. (Though
> this would have to wait for the next release since it’s a feature addition…
> we can declare releasing the GIL as a bug fix)
>
> -greg
>
>
> On Sat, 22 Oct 2022 at 09:48, David Cosgrove 
> wrote:
>
>> Hi,
>>
>> I'm doing a lot of tanimoto similarity calculations on large datasets
>> using BulkTanimotoSimilarity.  It is an obvious candidate for
>> parallelisation, so I am using concurrent.futures to do so.  If I use
>> ProcessPoolExectuor, I get good speed-up but each process needs a copy of
>> the fingerprint set and for the sizes I'm dealing with that uses too much
>> memory.  With ThreadPoolExecutor I only need 1 copy of the fingerprints,
>> but the GIL means it only runs on 1 thread at a time so there's no gain.
>> Would it be possible to amend the C++ BulkTanimotoSimilarity to free the
>> GIL whilst it's doing the calculation, and recapture it afterwards?  I
>> understand things like numpy do this for some of their functions.  I'm
>> happy to attempt it myself if someone who knows about these things can
>> advise that it could be done, it would help, and they could provide a few
>> pointers.
>>
>> Thanks,
>> Dave
>>
>>
>> --
>> David Cosgrove
>> Freelance computational chemistry and chemoinformatics developer
>> http://cozchemix.co.uk
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> --
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

2022-10-22 Thread Greg Landrum
Hi Dave,

We have multiple examples of this in the code, here’s one:
https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40

I’m not sure how this would interact with the call to Python::extract
that’s in the bulk functions though

It might be better to handle the multithreading on the C++ side by adding
an optional nThreads argument to  the bulk similarity functions. (Though
this would have to wait for the next release since it’s a feature addition…
we can declare releasing the GIL as a bug fix)

-greg


On Sat, 22 Oct 2022 at 09:48, David Cosgrove 
wrote:

> Hi,
>
> I'm doing a lot of tanimoto similarity calculations on large datasets
> using BulkTanimotoSimilarity.  It is an obvious candidate for
> parallelisation, so I am using concurrent.futures to do so.  If I use
> ProcessPoolExectuor, I get good speed-up but each process needs a copy of
> the fingerprint set and for the sizes I'm dealing with that uses too much
> memory.  With ThreadPoolExecutor I only need 1 copy of the fingerprints,
> but the GIL means it only runs on 1 thread at a time so there's no gain.
> Would it be possible to amend the C++ BulkTanimotoSimilarity to free the
> GIL whilst it's doing the calculation, and recapture it afterwards?  I
> understand things like numpy do this for some of their functions.  I'm
> happy to attempt it myself if someone who knows about these things can
> advise that it could be done, it would help, and they could provide a few
> pointers.
>
> Thanks,
> Dave
>
>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

2022-10-22 Thread David Cosgrove
Hi,

I'm doing a lot of tanimoto similarity calculations on large datasets using
BulkTanimotoSimilarity.  It is an obvious candidate for parallelisation, so
I am using concurrent.futures to do so.  If I use ProcessPoolExectuor, I
get good speed-up but each process needs a copy of the fingerprint set and
for the sizes I'm dealing with that uses too much memory.  With
ThreadPoolExecutor I only need 1 copy of the fingerprints, but the GIL
means it only runs on 1 thread at a time so there's no gain.  Would it be
possible to amend the C++ BulkTanimotoSimilarity to free the GIL whilst
it's doing the calculation, and recapture it afterwards?  I understand
things like numpy do this for some of their functions.  I'm happy to
attempt it myself if someone who knows about these things can advise that
it could be done, it would help, and they could provide a few pointers.

Thanks,
Dave


-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss