Re: [Rdkit-discuss] Partial substructure match?

2020-11-20 Thread Rajarshi Guha
One approach could be to assign scoring functions for bond and atom matches
(such as what OE supports
<https://docs.eyesopen.com/toolkits/python/oechemtk/patternmatch.html#mcs-scoring-functions>
)

On Fri, Nov 20, 2020 at 9:58 AM Gustavo Seabra 
wrote:

> Hi Adelene,
>
> Doesn't the substructure match only works for the whole substructure,  as
> an all-or-nothing?
>
> I suppose I could use the MCSS and count the number of matching atoms,
> then calculate the percentage match myself.
>
> Is it possible to get a partial match with substructure search?
>
> Gustavo.
>
> --
> Gustavo Seabra
>
> --
> *From:* Adelene LAI 
> *Sent:* Friday, November 20, 2020 9:13:15 AM
> *To:* Dan Nealschneider ; Gustavo
> Seabra 
> *Cc:* RDKit Discuss 
> *Subject:* Re: [Rdkit-discuss] Partial substructure match?
>
>
> Hi Dan and Gustavo,
>
>
> MCSS sounds good, but depends on the goal.
>
>
> From the way Gustavo wrote, it sounds like a Query-Target substructure
> search - he has a list of targets and one specific query, and he wants to
> compare matching rate amongst the members of the list.
>
>
> If so, I would try query SMARTS.
>
>
> https://www.rdkit.org/docs/GettingStartedInPython.html#substructure-searching
>
>
> Regarding the % substructure match, interesting question. How would you
> quantify that? Not sure such a thing exists in RDKit right now.
>
>
> Adelene
>
>
> Doctoral Researcher
>
> Environmental Cheminformatics
>
> UNIVERSITÉ DU LUXEMBOURG
>
>
> Campus Belval | Luxembourg Centre for Systems Biomedicine
>
> 6, avenue du Swing, L-4367 Belvaux
>
> T +356 46 66 44 67 18
>
> [image: github.png] adelenelai
>
>
>
>
>
>
>
>
>
>
> --
> *From:* Dan Nealschneider 
> *Sent:* Thursday, November 19, 2020 6:01:37 PM
> *To:* Gustavo Seabra
> *Cc:* RDKit Discuss
> *Subject:* Re: [Rdkit-discuss] Partial substructure match?
>
> Gustavo -
> That sounds like the "maximum common substructure" problem. Here's the
> relevant section in RDKit's  "Getting started in Python"
>
>
> https://www.rdkit.org/docs/GettingStartedInPython.html#maximum-common-substructure
>
>
> *dan nealschneider* | lead developer
> [image: Schrodinger Logo] <https://www.schrodinger.com/>
>
>
> On Thu, Nov 19, 2020 at 8:50 AM Gustavo Seabra 
> wrote:
>
> Hi all,
>
> Is it possible to search for *partial* substructure matches using RDKit?
>
> I'm aware of "HasSubstructMatch/ GetSubstructMatch", but my impression is
> that it only returns full matches (100%) of the required pattern in a
> structure.
>
> However, what I'd like to do is a bit different: Imagine I have one
> specific
> substructure (scaffold), and I'd like to search for molecules that have the
> full substructure *or part of it*, and maybe get the percentage of the
> substructure match? (100% = the full substructure is contained in the
> molecule). For example, if the pattern is a naphthalene and the molecule to
> search has a benzene, that would count as a 60% match.
>
> Is there a way to do that in RDKit?
>
> Thanks a lot!
> --
> Gustavo Seabra
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit ElasticSearch Plugin

2021-01-21 Thread Rajarshi Guha
There was a presentation at a recent Cambridge Cheminformatics meeting on
using Elastic for similarity searches, and I think the presenter was also
considering extending to substructure matching as well. But no code
available as far as I can tell

https://github.com/MysterionRise

On Thu, Jan 21, 2021 at 10:44 AM Joos Kiener  wrote:

> Hi Naomi,
>
> I once played around a bit with this idea using the Lucene-based RDKit
> example as guidance. However what that code does inside Lucene and hence my
> "adaption" inside elastic search is only the fingerprint screening part.
> For the actual subgraph-match the data then has to be sent to the
> caller/client and doesn't run inside elastic search and means one must
> manipulate the elastic search results (hit count, paging,...) before
> finally returning to the end user application. Simply said, not a very
> usable but very hacky solution.
>
> Even ignoring that part, it wasn't very fast either. That could be due to
> many things like only having 1 machine for ES (my machine, no cluster) and
> not being an expert in ES anyway (suboptimal config?). Or maybe the dataset
> was too small to actually benefit. Same data, same query is much faster in
> PostgreSQL + RDKit + Full-text index and easier to use. (Yes, PostgreSQL
> supports full-text search similar to elastic. if one doesn't need very
> advanced features or has a lot of data, for sure worth a look)
>
> Any "real solution" must also do the subgraph matching inside elastic
> itself which means writing a plugin / extension for elasticsearch. This was
> simply too involved for me to even try. (If that is of interest, you should
> probably also look at the very recent licensing changes to elasticsearch).
>
> The presentation Joshua mentioned is actually only about similarity search
> which naturally is easier to implement and fast.
>
> Having said that, there is a commercial solution available from
> PerkinElmer in their Signals Data factory offering. Of course this has
> nothing to do with RDKit but it does hint that it's possible to do this if
> you have the time, budget and skills/knowledge.
>
> Another  commercial "fast substructure search" option would be nextmoves
> Arthor but that has nothing to do with elasticsearch. Question is if you
> want elasticsearch due to the speed or due to the combination with text
> search. I would probably avoid it if the text search part is not important.
>
> Just using RDKit default functionality is actually pretty fast (see on
> Gregs blog), well it does run in memory. Nowadays a machine with lots of
> RAM doesn't cost all that much so I could see that scaling to 10-20 million
> structures easily.
>
> hope that helps you a bit to come to a conclusion on what to do.
>
> Best Regards,
>
> Joos
>
>
> -- Forwarded message --
>> From: Naomi Jacobs 
>> To: rdkit-discuss@lists.sourceforge.net
>> Cc: Alan Pierce , Larry Taylor 
>> Bcc:
>> Date: Wed, 20 Jan 2021 22:27:32 -0800
>> Subject: [Rdkit-discuss] RDKit ElasticSearch Plugin
>> Hi all,
>>
>> We're looking for information about whether anyone has built an
>> ElasticSearch plugin using RDKit to support chemical search. I didn't see
>> anything open-source online, but was thinking some folks may have heard
>> about internal efforts and would be willing to share any code and/or chat
>> about it. Thanks!
>>
>> Cheers,
>> Naomi
>>
>> --
>> *Naomi Jacobs*
>> Software Engineer | benchling.com
>> (415) 590-2798
>>
>>
>>
>> -- Forwarded message --
>> From: Greg Landrum 
>> To: Naomi Jacobs 
>> Cc: RDKit Discuss , Larry Taylor <
>> la...@benchling.com>
>> Bcc:
>> Date: Thu, 21 Jan 2021 08:54:08 +0100
>> Subject: Re: [Rdkit-discuss] RDKit ElasticSearch Plugin
>> Hi Naomi,
>>
>> I'm not personally aware of any ElasticSearch work, but there is a
>> prototype for a lucene plugin which could, I believe, be used as the basis
>> for an ElasticSearch plugin:
>> https://github.com/rdkit/org.rdkit.lucene
>>
>> It's (obviously) been a while since anyone did anything with that code
>> and it may no longer work, but the more recent (and still functional)
>> RDKit-neo4j integration (https://github.com/rdkit/neo4j-rdkit) can
>> provide some patterns for how the RDKit java integration can be used in
>> this type of context.
>>
>> I hope this helps, and would be interested to hear if you end up doing
>> anything with the RDKit and ElasticSearch.
>> -greg
>>
>>
>> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to make a database fingerprint

2021-09-15 Thread Rajarshi Guha
Is it correct to use Morgan fingerprints for this type of analysis, given
that individual bit positions don't correspond to specific
substructures/features? The original work used key fp's (MACCS and Pubchem)

On Wed, Sep 15, 2021 at 11:25 AM Patrick Walters 
wrote:

> numpy!
>
> import pandas as pd
> from descriptor_gen import DescriptorGen
> import numpy as np
> from rdkit import Chem, DataStructs
> from rdkit.Chem import AllChem
>
> def smi2fp(smi):
> mol = Chem.MolFromSmiles(smi)
> fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
> arr = np.zeros((0,), dtype=np.int8)
> DataStructs.ConvertToNumpyArray(fp,arr)
> return arr
>
> df = pd.read_csv("chembl_drugs.smi",sep=" ",names=["SMILES","Name"])
> df['fp'] = df.SMILES.apply(smi2fp)
> db_fp = np.stack(df.fp).sum(axis=0)
>
> On Wed, Sep 15, 2021 at 9:32 AM Giovanni Tricarico <
> giovanni.tricar...@glpg.com> wrote:
>
>> Hello,
>>
>> based on this article:
>>
>>
>>
>> https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1
>>
>>
>>
>> I have been trying to make what they call a ‘database fingerprint’.
>>
>>
>>
>> The first step seems to require obtaining the frequencies of each
>> fingerprint bit in a database of molecules.
>>
>> To do that, I calculated the fingerprints of a list of molecules (much
>> larger than the one below; this is just an example):
>>
>>
>>
>> ms = [Chem.MolFromSmiles(s) for s in ['c1c1','CCC','CCCO']]
>>
>> fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for
>> m in ms]
>>
>>
>>
>> My first attempt to obtain the database fingerprint was by looping trough
>> the fps and summing (+=), as that is reported to be an allowed operation
>> for these fingerprints.
>>
>> This worked, but was very slow.
>>
>>
>>
>> My next attempt was to convert each fingerprint to a dictionary, and
>> build the dictionary corresponding to the database fingerprint:
>>
>>
>>
>> database_fp_new = dict()
>>
>>
>>
>> for i,fp in enumerate(fps):
>>
>> for fpbit in fp.GetNonzeroElements():
>>
>> if fpbit in database_fp_new:
>>
>> database_fp_new[fpbit] += 1
>>
>> else:
>>
>> database_fp_new[fpbit] = 1
>>
>>
>>
>> This worked, too, gave the same result as the ‘#=’ approach, and was much
>> faster.
>>
>>
>>
>> {98513984: 1,
>>
>> 2763854213: 1,
>>
>> 3218693969: 1,
>>
>> 3741631696: 1,
>>
>> 2068133184: 1,
>>
>> 2245384272: 2,
>>
>> 2246728737: 2,
>>
>> 3542456614: 2,
>>
>> 864662311: 1,
>>
>> 1173125914: 1,
>>
>> 1365892349: 1,
>>
>> 1535166686: 1,
>>
>> 4023654873: 1}
>>
>>
>>
>> However, then I have a dictionary.
>>
>> But I need a fingerprint, because I want to do operations like similarity
>> calculations (e.g.
>> https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity
>> ).
>>
>>
>>
>> Would anyone be able suggest if and how the dictionary can be turned back
>> into a fingerprint, or perhaps advise how to make the database fingerprint
>> in a different way, if the one I figured out is not optimal?
>>
>>
>>
>> Thank you
>>
>> --
>>
>> This e-mail and its attachment(s) (if any) may contain confidential
>> and/or proprietary information and is intended for its addressee(s) only.
>> Any unauthorized use of the information contained herein (including, but
>> not limited to, alteration, reproduction, communication, distribution or
>> any other form of dissemination) is strictly prohibited. If you are not the
>> intended addressee, please notify the originator promptly and delete this
>> e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor
>> any of its affiliates shall be liable for direct, special, indirect or
>> consequential damages arising from alteration of the contents of this
>> message (by a third party) or as a result of a virus being passed on.
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] MFP question about similar substructures and feature reduction

2021-09-29 Thread Rajarshi Guha
I'd be wary of using PCA on binary fingerprints based on Martin and Cao
(2015 <https://dx.doi.org/10.1007/s10822-014-9819-y>)

On Wed, Sep 29, 2021 at 3:34 PM Rafael L via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> Hello, your question prompted me to write a small notebook, which I hope
> you may find useful:
>
> https://github.com/rflameiro/projects/blob/main/comparing_fingerprint_bits.ipynb
>
> In summary, bits that are active in both fingerprints usually correspond
> to the same substructure, unless bit collision happens. You can verify that
> by drawing the substructure that activates a certain bit using the
> function Draw.DrawMorganBit().
>
> -- What happens if the 2048 bits or substructures predesignated in rdkit
> do not contain a new substructure in a molecule we are evaluating?
> If I understand correctly, you want to know what will a fingerprint look
> like for a molecule that doesn't have new substructures compared to a
> previously calculated fingerprint. In this case, the new fingerprint will
> be the same (although this is more common when working with MACCS
> fingerprints, which work with a predetermined set of substructures), or the
> new molecule will have less substructures than the previous one, and less
> bits will be active.
>
> -- Any advice on how to reduce features and then use that reduced feature
> list for new molecules after training a model would also be appreciated.
> How would the model only extract the reduced bits for a new ligand if I
> remove low variance bits from the training set for example?
> To build models on fingerprints, you can start using the complete set of
> 2048 bits, and compare the performance with fingerprints containing less
> bits (1024, 512...). A good starting point is:
>
> https://www.moreisdifferent.com/2017/9/21/DIY-Drug-Discovery-using-molecular-fingerprints-and-machine-learning-for-solubility-prediction/
> You should see a drop in performance as the bit size decreases, as bit
> collisions are more likely.
> Alternatively, you could try reducing the dimensionality by using a
> technique such as PCA, but use enough PCs to get a reasonable explained
> variance percentage. It is easy to calculate PCs with scikit-learn. Then,
> to apply it in new fingerprints, you will only have to call .transform().
> See:
> https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
>
> Em seg., 27 de set. de 2021 às 20:35, Natasha Gupta 
> escreveu:
>
>> Hello,
>>
>> Apologies. this is a very basic question:
>> If I am converting many ligands into morgan fingerprints, could I
>> theoretically stack the bit representations on top of each other to get the
>> same features represented across ligands? For example is the below
>> representation correct?
>>
>> | sample | feature1 | feature2 | feature3 |
>> |:   |::|::|-:|
>> | 1  | bit 1| bit 2| bit 3|
>> | 2  | bit 1| bit 2| bit 3|
>> | 3  | bit 1| bit 2| bit 3|
>>
>> So basically is feature 1, 2, 3 etc always one type of substructure no
>> matter what the input molecule is? What happens if the 2048 bits or
>> substructures predesignated in rdkit do not contain a new substructure in a
>> molecule we are evaluating?
>>
>> Any advice on how to reduce features and then use that reduced feature
>> list for new molecules after training a model would also be appreciated.
>> How would the model only extract the reduced bits for a new ligand if I
>> remove low variance bits from the training set for example?
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> Rafael da Fonseca Lameiro
> [image: orcid logo 16px] https://orcid.org/-0003-4466-2682
> Aluno de Doutorado - Grupo de Química Medicinal e Biológica (NEQUIMED)
> Instituto de Química de São Carlos - Universidade de São Paulo - Brasil
> Av. Trabalhador Sancarlense, 400 - CEP: 13566-590 - São Carlos/SP
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Rajarshi Guha
You could consider using FAISS. An example of clustering 2.1M cmpds is
described at
http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html


On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
tristan.camilleri...@um.edu.mt> wrote:

> Hi,
>
> I am attempting to cluster a database of circa 4M small molecules and I
> have hit several snags.
> Using BulkTanimoto is not possible due to resiurces that are required. I
> am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix). However, I am finding it very challenging to identify an
> appropriate clustering algorithm. I have considered both k-medoids and
> DBSCAN. Each of these has its own limitations, stating the number of
> clusters for k-medoids and not obtaining centroids for DBSCAN.
>
> I was wondering whether there is an implementation of the stochastic
> clustering analysis for clustering purposes, described in
> https://doi.org/10.1021/ci970056l .
>
> Any suggestions on the best method for clustering large datasets, with
> code suggestions, would be greatly appreciated. I am new to the subject and
> would appreciate any help.
>
> Regards,
> Tristan
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] comparing two or more tables of molecules

2016-11-28 Thread Rajarshi Guha
It really boils down to how you standardize molecules such that you end up
with a canonical structure.

SMILES not the issue here - if you standardizer does a proper job with
aromaticity, tautomers etc then you can get a canonical SMILES.

You can use the InChI model as well as to generate a canonical SMILES (
https://jcheminf.springeropen.com/articles/10.1186/1758-2946-4-22).

This doesn't really answer your question, as I'm not familiar with RDKit
functionality for standardization.

(As an aside, internally we use https://github.com/ncats/lychi which is
conceptually similar to InChI)

PS. I don't think this is a job for fingerprint based similarity methods
though

On Mon, Nov 28, 2016 at 11:25 AM, Stephen O'hagan 
wrote:

> Has anyone come up with fool-proof way of matching structurally equivalent
> molecules?
>
>
>
> Unique Smiles or InChI String comparisons don’t appear to work presumable
> because there are different but equivalent structures, e.g. explicit vs
> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form,
> tautomers etc.
>
>
>
> I also expect that comparing InChI strings might need something more than
> just a simple string comparison, such as masking off stereo information
> when you don’t care about stereo isomers.
>
>
>
> I assume there are suitable tools within RDKit that can do this?
>
>
>
> N.B. I need to collate tables from several sources that have a mix of
> smiles / InChI / sdf molecular representations.
>
>
>
> I usually use RDKit via Python and/or Knime.
>
>
>
> Cheers,
>
> Steve.
>
>
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>


-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Postgres query performance

2017-11-30 Thread Rajarshi Guha
Hi, I'm working on an application that is trying to do arbitrary
substructure queries across ChEMBL 23. I pretty much followed the
instructions at http://www.rdkit.org/docs/Cartridge.html on a Postgres 9.2
instance with RDKit 2016.03.1 all running on a Linux box with 88GB of RAM.

But when running

select count(*) from rdk.mols where m@>'c1cncc2n1ccn2' ;

the query gives back 1775 rows as noted in the page, but takes 2016.182 ms
compared to the 88ms reported on the page.

I realize there are a lot of factors unrelated to RDkit that affect query
performance, but does anybody have suggestions to boost substructure query
performance?

Thanks,

-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Postgres query performance

2017-12-08 Thread Rajarshi Guha
Thanks Riccardo.

While I can see Rdkit 2017.09.2.0 is available on conda (
https://anaconda.org/search?q=rdkit), rdkit-postgres seems to be stuck
at 2016.03.4

Can I update the rdkit installation and use it to update the Postgres
extension? Or must I wait for a rdkit-postgres conda with the right version
of Rdkit?

On Thu, Nov 30, 2017 at 6:03 PM, Riccardo Vianello <
riccardo.viane...@gmail.com> wrote:

> Hi Rajarshi,
>
> On Thu, Nov 30, 2017 at 10:31 PM, Rajarshi Guha 
> wrote:
>
>> Hi, I'm working on an application that is trying to do arbitrary
>> substructure queries across ChEMBL 23. I pretty much followed the
>> instructions at http://www.rdkit.org/docs/Cartridge.html on a Postgres
>> 9.2 instance with RDKit 2016.03.1 all running on a Linux box with 88GB of
>> RAM.
>>
>
> some 2016 releases were affected by a bug in the cartridge Makefile that
> disabled any compiler optimization. If possible, I would suggest trying a
> more recent version (>= 2016.09.1).
>
> Best,
> Riccardo
>
>


-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Postgres query performance

2017-12-08 Thread Rajarshi Guha
Thanks Greg.

Right now I have

rdkit 2017.09.2.0  py27h080088d_1rdkit

rdkit-postgresql  2016.03.3py27_1rdkit

But when I try to install the extension I get an error

chembl23=# create extension if not exists rdkit;

ERROR:  could not load library
"/home/rdkit/.conda/envs/ifx-rdkit-env/lib/postgresql/rdkit.so":
libAvalonLib.so.1: cannot open shared object file: No such file or directory

STATEMENT:  create extension if not exists rdkit;

ERROR:  could not load library
"/home/rdkit/.conda/envs/ifx-rdkit-env/lib/postgresql/rdkit.so":
libAvalonLib.so.1: cannot open shared object file: No such file or directory

Would this be related to the mismatch between the rdkit and rdkit-postgres
versions?

If so, what is the recommended way to get postgres to use the 2017 version
of rdkit?

On Fri, Dec 8, 2017 at 10:07 AM, Greg Landrum 
wrote:

> At some point, in the not too distant future (I hope), there will be an
> updated version of the cartridge.
>
> On Fri, Dec 8, 2017 at 3:55 PM, Rajarshi Guha 
> wrote:
>
>> Thanks Riccardo.
>>
>> While I can see Rdkit 2017.09.2.0 is available on conda (
>> https://anaconda.org/search?q=rdkit), rdkit-postgres seems to be stuck
>> at 2016.03.4
>>
>> Can I update the rdkit installation and use it to update the Postgres
>> extension? Or must I wait for a rdkit-postgres conda with the right version
>> of Rdkit?
>>
>> On Thu, Nov 30, 2017 at 6:03 PM, Riccardo Vianello <
>> riccardo.viane...@gmail.com> wrote:
>>
>>> Hi Rajarshi,
>>>
>>> On Thu, Nov 30, 2017 at 10:31 PM, Rajarshi Guha >> > wrote:
>>>
>>>> Hi, I'm working on an application that is trying to do arbitrary
>>>> substructure queries across ChEMBL 23. I pretty much followed the
>>>> instructions at http://www.rdkit.org/docs/Cartridge.html on a Postgres
>>>> 9.2 instance with RDKit 2016.03.1 all running on a Linux box with 88GB of
>>>> RAM.
>>>>
>>>
>>> some 2016 releases were affected by a bug in the cartridge Makefile that
>>> disabled any compiler optimization. If possible, I would suggest trying a
>>> more recent version (>= 2016.09.1).
>>>
>>> Best,
>>> Riccardo
>>>
>>>
>>
>>
>> --
>> Rajarshi Guha | http://blog.rguha.net
>> NIH Center for Advancing Translational Science
>>
>> ----
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>


-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Postgres query performance

2017-12-11 Thread Rajarshi Guha
Thanks a lot - we'll give this a try

On Mon, Dec 11, 2017 at 12:56 AM, Greg Landrum 
wrote:

> This morning I managed to get at least something done. It should now be
> possible to install the cartridge for linux from the rdkit conda channel
> with the command:
> conda install -c rdkit rdkit-postgresql95
>
> caveats:
> - linux only. I built it on centos6, so it should work on just about any
> linux system though
> - requires you to use postgresql95 (also from the rdkit channel)
> - does not have pl/python installed
> - since I had to grab an early train this morning I haven't done much
> testing
>
> This is a long way from being optimal, but it should at least be
> functional. Please let me know if you get a chance to try it out.
>
> I *hope* that this will all get easier when we switch over to using the
> new compiler packages that are part of conda-build v3. I'm trying the first
> experiments with this now.
>
> -greg
>
>
> On Fri, Dec 8, 2017 at 4:07 PM, Greg Landrum 
> wrote:
>
>> At some point, in the not too distant future (I hope), there will be an
>> updated version of the cartridge.
>>
>> On Fri, Dec 8, 2017 at 3:55 PM, Rajarshi Guha 
>> wrote:
>>
>>> Thanks Riccardo.
>>>
>>> While I can see Rdkit 2017.09.2.0 is available on conda (
>>> https://anaconda.org/search?q=rdkit), rdkit-postgres seems to be stuck
>>> at 2016.03.4
>>>
>>> Can I update the rdkit installation and use it to update the Postgres
>>> extension? Or must I wait for a rdkit-postgres conda with the right version
>>> of Rdkit?
>>>
>>> On Thu, Nov 30, 2017 at 6:03 PM, Riccardo Vianello <
>>> riccardo.viane...@gmail.com> wrote:
>>>
>>>> Hi Rajarshi,
>>>>
>>>> On Thu, Nov 30, 2017 at 10:31 PM, Rajarshi Guha <
>>>> rajarshi.g...@gmail.com> wrote:
>>>>
>>>>> Hi, I'm working on an application that is trying to do arbitrary
>>>>> substructure queries across ChEMBL 23. I pretty much followed the
>>>>> instructions at http://www.rdkit.org/docs/Cartridge.html on a
>>>>> Postgres 9.2 instance with RDKit 2016.03.1 all running on a Linux box with
>>>>> 88GB of RAM.
>>>>>
>>>>
>>>> some 2016 releases were affected by a bug in the cartridge Makefile
>>>> that disabled any compiler optimization. If possible, I would suggest
>>>> trying a more recent version (>= 2016.09.1).
>>>>
>>>> Best,
>>>> Riccardo
>>>>
>>>>
>>>
>>>
>>> --
>>> Rajarshi Guha | http://blog.rguha.net
>>> NIH Center for Advancing Translational Science
>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>


-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Postgres indexing question

2018-01-05 Thread Rajarshi Guha
Hi, I'm using RDkit 2017.09 with Postgres 9.5. I loaded in ChEMBL23 using
the instructions given at http://www.rdkit.org/docs/Cartridge.html and set
up the indexes etc.

However, when I run a substructure query I, EXPLAIN ANALYZE output looks
like

chembl_23=> explain analyze select count(*) from rdk.mols where m@
>'C1CC2CCC3C(CCC4C34)C2C1';

QUERY PLAN


--

 Aggregate  (cost=6564.86..6564.87 rows=1 width=0) (actual
time=10546.899..10546.899 rows=1 loops=1)

   ->  Bitmap Heap Scan on mols  (cost=369.80..6560.54 rows=1727 width=0)
(actual time=465.881..10539.663 rows=8170 loops=1)

 Recheck Cond: (m @> 'C1CCC2C(C1)CCC1C33CCC21'::mol)

 Rows Removed by Index Recheck: 16131

 Heap Blocks: exact=9458

 ->  Bitmap Index Scan on molidx  (cost=0.00..369.37 rows=1727
width=0) (actual time=460.029..460.029 rows=24301 loops=1)

   Index Cond: (m @> 'C1CCC2C(C1)CCC1C33CCC21'::mol)

 Planning time: 1.258 ms

 Execution time: 10548.293 ms

While it's using the GiST index, I note the big difference in the expected
(1727) and actual (24301) row counts for the Bitmap Index Scan node.

This seems to suggest that the index statistics are not accurate. Has
anybody noticed this? Would fiddling with planner settings (such
as default_statistics_target) be useful for this?

Interestingly, what ever query SMILES I put in, the expected row count for
the Bitmap Index Scan is always 1727. Is this by design?

(The other aspect I noted is that in the subsequent Bitmap Heap Scan, a
large number of rows are discarded. Since the heap pages pointed to by the
Bitmap Index Scan node have to be scanned completely, is it feasible for
structurally similar compounds to be colocated in a heap page? Or is this
beyond the scope of the GiST index?)

-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Postgres indexing question

2018-01-07 Thread Rajarshi Guha
On Sun, Jan 7, 2018 at 9:47 AM, Greg Landrum  wrote:

> Hi Rajarshi,
>
> On Sat, Jan 6, 2018 at 6:32 AM, Rajarshi Guha 
> wrote:
>
>> Hi, I'm using RDkit 2017.09 with Postgres 9.5. I loaded in ChEMBL23 using
>> the instructions given at http://www.rdkit.org/docs/Cartridge.html and
>> set up the indexes etc.
>>
>> However, when I run a substructure query I, EXPLAIN ANALYZE output looks
>> like
>>
>> chembl_23=> explain analyze select count(*) from rdk.mols where m@
>> >'C1CC2CCC3C(CCC4C34)C2C1';
>>
>> QUERY PLAN
>>
>>
>> 
>> --
>>
>>  Aggregate  (cost=6564.86..6564.87 rows=1 width=0) (actual
>> time=10546.899..10546.899 rows=1 loops=1)
>>
>>->  Bitmap Heap Scan on mols  (cost=369.80..6560.54 rows=1727
>> width=0) (actual time=465.881..10539.663 rows=8170 loops=1)
>>
>>  Recheck Cond: (m @> 'C1CCC2C(C1)CCC1C33CCC21'::mol)
>>
>>  Rows Removed by Index Recheck: 16131
>>
>>  Heap Blocks: exact=9458
>>
>>  ->  Bitmap Index Scan on molidx  (cost=0.00..369.37 rows=1727
>> width=0) (actual time=460.029..460.029 rows=24301 loops=1)
>>
>>Index Cond: (m @> 'C1CCC2C(C1)CCC1C33CCC21'::mol)
>>
>>  Planning time: 1.258 ms
>>
>>  Execution time: 10548.293 ms
>>
>> While it's using the GiST index, I note the big difference in the
>> expected (1727) and actual (24301) row counts for the Bitmap Index Scan
>> node.
>>
>> This seems to suggest that the index statistics are not accurate. Has
>> anybody noticed this? Would fiddling with planner settings (such
>> as default_statistics_target) be useful for this?
>>
>
> Almost certainly
>

Surprisingly, changing this setting and reindexing the column doesn't
change anything, so it does look like more needs to be done to integrate
with the planner (or at least provide statistics somehow)

Thanks for the pointers
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] postgres cartridge not parsing a SMILES

2018-01-13 Thread Rajarshi Guha
Hi, I'm using RDKit 2017.09 with Postgres 9.5 and a substructure query is
failing when the query SMILES is

C1=CC=C(C=C1)[N]2=CC=CC3=C2C4=C(C=CC(=C4)C5=CC=CN=C5)N=C3

The error reported from Postgres is

PSQLException: ERROR: could not create molecule from SMILES
'C1=CC=C(C=C1)[N]2=CC=CC3=C2C4=C(C=CC(=C4)C5=CC=CN=C5)N=C3'

  Position: 81



The SMILES is parsed by CDK and JChem and I can't see why this should fail.

I must be missing something obvious (?)
-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] postgres cartridge not parsing a SMILES

2018-01-13 Thread Rajarshi Guha
Ah, should have checked in the shell first! Thanks for the pointer

On Sat, Jan 13, 2018 at 5:07 PM, Andrew Dalke 
wrote:

> Hi Rajarshi,
>
> Here's what RDKit says from the interactive shell:
>
> >>> from rdkit import Chem
> >>> Chem.MolFromSmiles("C1=CC=C(C=C1)[N]2=CC=CC3=C2C4=C(C=CC(=
> C4)C5=CC=CN=C5)N=C3")
> [23:02:36] Explicit valence for atom # 6 N, 4, is greater than permitted
>
> RDKit is pretty strict about accepting chemically reasonable structures,
> and will reject a lot of structures which other programs accept.
>
> This warning about a too-high valence on a nitrogen is probably the most
> common failure message I get from RDKit's SMILES parser.
>
> Cheers,
>
>
> Andrew
> da...@dalkescientific.com
>
>
> > On Jan 13, 2018, at 22:52, Rajarshi Guha 
> wrote:
> >
> > Hi, I'm using RDKit 2017.09 with Postgres 9.5 and a substructure query
> is failing when the query SMILES is
> >
> > C1=CC=C(C=C1)[N]2=CC=CC3=C2C4=C(C=CC(=C4)C5=CC=CN=C5)N=C3
> >
> > The error reported from Postgres is
> >
> > PSQLException: ERROR: could not create molecule from SMILES
> 'C1=CC=C(C=C1)[N]2=CC=CC3=C2C4=C(C=CC(=C4)C5=CC=CN=C5)N=C3'
> >   Position: 81
> >
> >
> > The SMILES is parsed by CDK and JChem and I can't see why this should
> fail.
> >
> > I must be missing something obvious (?)
> > --
> > Rajarshi Guha | http://blog.rguha.net
> > NIH Center for Advancing Translational Science
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>



-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] contrib code not compiling

2018-11-18 Thread Rajarshi Guha
Hi, I check out the latest RDKit sources from master and I'm trying to
compile the PBF.  However, the compilation fails reporting
that RDGeneral/export.h is missing:

(rdkit) PBF guha$ make

c++ -O2 -I/usr/local/include -I/Users/guha/src/rdkit/Code -Wno-deprecated
-I/usr/local/include/eigen3 -DUSE_EIGEN2  -c -o demo.o demo.cpp

In file included from demo.cpp:12:

/Users/guha/src/rdkit/Code/RDGeneral/Invariant.h:12:10: fatal error:
'RDGeneral/export.h'
file not found

#include 

 ^~~~

1 error generated.

make: *** [demo.o] Error 1

(I haven't compiled RDKit as I already have it installed via conda)

-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] contrib code not compiling

2018-11-20 Thread Rajarshi Guha
My mistake - I realized that the PBF descriptor is part of the main codebase

On Tue, Nov 20, 2018 at 5:20 AM Andrew Dalke 
wrote:

> On Nov 19, 2018, at 04:17, Rajarshi Guha  wrote:
> > Hi, I check out the latest RDKit sources from master and I'm trying to
> compile the PBF.  However, the compilation fails reporting that
> RDGeneral/export.h is missing:
>
> While this doesn't answer the question, it seems to be coupled to
>
>   https://github.com/rdkit/rdkit/issues/1903
>
> > (I haven't compiled RDKit as I already have it installed via conda)
>
> It appears that 'export.h' is created during the "cmake" step. That file
> is an 'auto-generated __declspec definition header'.
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss