Re: [Rdkit-discuss] Postgres cartridge functions

2017-01-05 Thread Greg Landrum
Hi Jaan,


On Sun, Jan 1, 2017 at 8:58 AM, jaan gruber  wrote:

> Hi.
> I have been experimenting with rdkit and PG cartridge recently. Both are
> really cool.
>

Glad to hear it!


> However I have a question related to similarity functions like this
>
> """create or replace function get_mfp2_neighbors(smiles text)
>
> returns table(molregno integer, m mol, similarity double precision) as
>   $$
>   select 
> molregno,m,tanimoto_sml(morganbv_fp(mol_from_smiles($1::cstring)),mfp2)
> as similarity
>   from rdk.fps join rdk.mols using (molregno)
>   where morganbv_fp(mol_from_smiles($1::cstring))%mfp2
>   -- order by morganbv_fp(mol_from_smiles($1::cstring))<%>mfp2;
>   $$ language sql stable ;
> """
>
> Is it possible to modify the function so it doesn't recalculate
> morganbv_fp(mol_from_smiles($1::cstring))?
>

Sure, but I believe that PostgreSQL is smart enough that it doesn't
actually do any recalculation.

Here's a version of the function that only has the explicit fingerprint
calculation once:

create or replace function get_mfp2_neighbors2(smiles text)
returns table(molregno integer, m mol, similarity double precision) as
  $$
  select molregno,m,tanimoto_sml(qfp,mfp2) as similarity
  from (select molregno,mfp2,morganbv_fp(mol_from_smiles($1::cstring)) qfp
from rdk.fps) fps
  join rdk.mols using (molregno)
  where qfp%mfp2
  order by qfp<%>mfp2;
  $$ language sql stable ;


I think I like this the readability of this a bit better than the original
form, but the performance is the same:

chembl_21=# select get_mfp2_neighbors('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)
c1ccc(C#N)cc1');
 get_mfp2_neighbors

 (2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
 (6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1c1",0.775510204081633)
 (5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)

Time: 573.769 ms
chembl_21=# select get_mfp2_neighbors2('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)
c1ccc(C#N)cc1');
get_mfp2_neighbors2

 (2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
 (6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1c1",0.775510204081633)
 (5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)

Time: 571.064 ms


> Or would it be possible that the function accept mol or even better -
> fingerprint?
>

Also possible:

create or replace function get_mfp2_neighbors3(qfp bfp)
  returns table(molregno integer, m mol, similarity double precision) as
$$
select molregno,m,tanimoto_sml(qfp,mfp2) as similarity
from rdk.fps
join rdk.mols using (molregno)
where qfp%mfp2
order by qfp<%>mfp2;
$$ language sql stable ;


 But, again, it doesn't make a difference in terms of performance:

chembl_21=# select get_mfp2_neighbors3(morganbv_
fp('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1'));
get_mfp2_neighbors3

 (2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
 (6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1c1",0.775510204081633)
 (5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)

Time: 564.451 ms

Or maybe ID of your "query fingerprint" in another table? I a plan to run
> large number of comparisons (tens of thousands) with several fingerprints,
> so I am looking for ways how to make it fastest possible.
>

PostgreSQL tends to be pretty smart about not doing un-necessary work, so I
believe that the original form is optimal in terms of performance, but you
can choose others if you find them more readable/useable.

Hopefully there's enough info here to allow you to do some experiments of
your own.

-greg
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fwd: conda / Windows update to 2016.09 release gives error

2017-01-05 Thread Curt Fischer
This worked for me.  Thanks Greg.  CF

On Tue, Jan 3, 2017 at 8:03 PM, Greg Landrum  wrote:

> Curt,
>
> If you change lines 32 and 33 in /lib/site-
> packages\rdkit\RDConfig.py
> to:
>   condaDir += ['Library', 'share', 'RDKit']
>   _share = os.path.join(*condaDir)
>
> I think it should work.
>
> Sorry for the inconvenience here; we will fix it before running the next
> conda builds.
>
> -greg
>
>
> On Wed, Jan 4, 2017 at 1:44 AM, Curt Fischer 
> wrote:
>
>>
>> Thanks for writing in Matt!
>>
>> Do you or any other readers think there is any chance that a small manual
>> fix to RDConfig.py could fix the problem?  I have very little experience
>> with building anything from source and would like to use the newest version
>> of rdkit if possible.  Would it be as simple as adding the *.sep* to
>> */lib/site-packages\rdkit\RDConfig.py* ?
>> Curt
>>
>> On Wed, Dec 21, 2016 at 2:22 AM, Matthew Swain  wrote:
>>
>>> I've also encountered this problem with the 2016.09.2 windows packages
>>> on the rdkit conda channel. It looks like somehow the RDConfig patch in the
>>> conda recipe hasn't been applied properly in the published packages.
>>>
>>> The original lines in the rdkit are:
>>>
>>> condaDir += ['share', 'RDKit']
>>> _share = os.path.join(*condaDir)
>>>
>>> The conda recipe has a Windows-specific patch to change this to:
>>>
>>> condaDir += ['Library','share','RDKit']
>>> _share = os.path.sep.join(condaDir)
>>>
>>> Which looks fine (although the second line doesn't really need
>>> changing?). But in the published packages it is:
>>>
>>> condaDir += ['share', 'RDKit', 'RDKit']
>>> _share = os.path.join(condaDir)
>>>
>>> This causes the AttributeError because it incorrectly passes a list to
>>> os.path.join, with no asterisk for unpacking the list into *args. The first
>>> line is also incorrect.
>>>
>>> I built the package myself from the recipe, and didn't see this issue.
>>>
>>> Matt
>>>
>>> On Dec 09, 2016, at 05:05 PM, Curt Fischer 
>>> wrote:
>>>
>>> I'm not sure of the source of the problem with the conda 2016.09 release
>>> on my Windows box, but I was able to revert to a 2016.03 release with a 
>>> *conda
>>> install -c rmg rdkit=2016.03**
>>>
>>> conda couldn't seem to solve the specifications automagically, but after
>>> I uninstalled boost and did the above command, it identified the proper
>>> boost to install along with the 2016.03 rdkit.
>>>
>>> I now have a functioning rdkit again, but would still be interested in
>>> hearing from anyone that experiences a similar problem.
>>>
>>> On Thu, Dec 8, 2016 at 9:27 AM, Curt Fischer 
>>> wrote:
>>>
 To update rdkit to the September release, I recently did a

 *conda install -f --channel https://conda.anaconda.org/rdkit
  rdkit*

 on my Windows box, and everything seemed to update fine.

 However now, when I try from rdkit import Chem, I get the disturbing
 error message below.

 Is this a sign that my particular installation got borked somehow, and
 I maybe I should reinstall everything again?  Or is this perchance a known
 issue with the 2016.09 release?  If the latter, how do I roll back to the
 old release using conda?  I tried a *conda install --channel
 https://conda.anaconda.org/rdkit 
 rdkit=2016.03.4 *but that didn't seem to do it.

 Thanks all for any help!

 Curt

 ---AttributeError
 Traceback (most recent call 
 last) in ()> 1 from rdkit import 
 Chem
 C:\Anaconda2\lib\site-packages\rdkit\Chem\__init__.py in () 17 
 """ 18 from rdkit import rdBase---> 19 from rdkit import RDConfig 
 20  21 from rdkit import DataStructs
 C:\Anaconda2\lib\site-packages\rdkit\RDConfig.py in () 31 
 condaDir[0] = os.path.sep 32   condaDir += ['share', 'RDKit', 
 'RDKit']---> 33   _share = os.path.join(condaDir) 34   RDDataDir = 
 os.path.join(_share, 'Data') 35   RDDocsDir = os.path.join(_share, 
 'Docs')
 C:\Anaconda2\lib\ntpath.pyc in join(path, *paths) 63 def join(path, 
 *paths): 64 """Join two or more pathname components, inserting 
 "\\" as needed."""---> 65 result_drive, result_path = splitdrive(path) 
 66 for p in paths: 67 p_drive, p_path = splitdrive(p)
 C:\Anaconda2\lib\ntpath.pyc in splitdrive(p)114 """115 if 
 len(p) > 1:--> 116 normp = p.replace(altsep, sep)117 
 if (normp[0:2] == sep*2) and (normp[2:3] != sep):118 # is 
 a UNC path:
 AttributeError: 'list' object has no attribute 'replace'


>>>