Re: [Rdkit-discuss] RDkit cartridge on Amazon RDS

2015-05-25 Thread Tim Dudgeon

I followed up with AWS on this, and they they stated:


For the extension to be available we need to test it and white-list the 
extension. I have put in a feature request for the module to added in the 
future. Please note that this may take some time and there is no definite 
deadline at this point in time.


So, maybe this will happen sometime, but don't hold your breath.

Tim

On 12/05/2015 02:53, Greg Landrum wrote:

Hi Tim,

On Mon, May 11, 2015 at 6:25 PM, Tim Dudgeon tdudgeon...@gmail.com 
mailto:tdudgeon...@gmail.com wrote:


I wondered if anyone has considered trying to get the RDkit cartridge
running on Amazon Relational Database Service (RDS)? When using
this you
don't get a shell on the underlying server and everything has to
be done
directly at the database level.
This seems to be the most relevant information:

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#SQLServer.Concepts.General.FeatureSupport


It looks like you would need Amazon to add the rdkit to the list of 
supported extensions. This would be quite cool and, depending on what 
the underlying operating system is, may be very easy for them to do.


-greg



--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Molecular dis / similarity using fingerprints

2015-05-25 Thread JP
RDKitters,

I have a partial RDKit / partial Methodology question.  I hope this email
isn't much of the how long is a piece of string nature.

I have a set of molecules (~30,000) which I would like to get a structural
diversity index for.  So I thought easy - generate some fingerprint I
fancy (ECFP-like, rad 2), take a threshold I fancy (0.7), select a
similarity metric I fancy (Tanimoto) and apply these to the set in a
pairwise fashion (you can only do this for a small-ish number of
molecules).  The resulting distribution of Tanimoto scores defines the
similarity (or dissimilarity) of the set.

First of all is there a better way to do this? Does anyone have a feel for
the numbers to use (fingerprint type, radius, no of bits)?  Is there some
'Industry standard'?  Which method should I use
GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I wanted
ECFP like fingerprints) ?  What determines when to use one over the other?

All my scores are rather low even for relatively similar structures -- so I
think one of my parameters must be off.  Just adding (or removing) a
carbonyl drops my score to 0.43.
I made this notebook example:
http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7

To the RDKit question: GetMorganFingerprintAsBitVect and
GetMorganFingerprint give different tanimoto scores (with same radius: 2).
This is of course because for the explicit bit vector we can set the length
of the vector/fingerprint.  Is there an equivalence between the two? (say
using n bits gives same results as GetMorganFingerprint).  How come the
GetMorganFingerprint method has no user-defined length for the
fingerprint?  What are the hashed equivalents of these fingerprints (e.g.
GetHashedMorganFingerprint) ?

Take care,
JP

ps A small suggestion, if I am allowed.  The fingerprint classes could do
with an informative toString (or non Java equivalent) - I know there is
ToBitString, but you need to call that explicitly when printing
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecular dis / similarity using fingerprints

2015-05-25 Thread Tim Dudgeon
ECFP, like all fingerprints are notoriously sensitive to trivial 
structural changes so your example is no surprise (the extra carbonyl 
group also affects aromaticity depending on your view of aromaticity). 
But I don't think there is anything simple and fast that's better.
Maybe a clustering approach may work? Something like sphere exclusion 
clustering with counting the number of clusters at 0.9 - 0.8 
similarity)? With 30K structures it sounds computationally tractable?


Tim

On 25/05/2015 15:10, JP wrote:

RDKitters,

I have a partial RDKit / partial Methodology question.  I hope this 
email isn't much of the how long is a piece of string nature.


I have a set of molecules (~30,000) which I would like to get a 
structural diversity index for.  So I thought easy - generate some 
fingerprint I fancy (ECFP-like, rad 2), take a threshold I fancy 
(0.7), select a similarity metric I fancy (Tanimoto) and apply these 
to the set in a pairwise fashion (you can only do this for a small-ish 
number of molecules). The resulting distribution of Tanimoto scores 
defines the similarity (or dissimilarity) of the set.


First of all is there a better way to do this? Does anyone have a feel 
for the numbers to use (fingerprint type, radius, no of bits)?  Is 
there some 'Industry standard'?  Which method should I use 
GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I 
wanted ECFP like fingerprints) ?  What determines when to use one over 
the other?


All my scores are rather low even for relatively similar structures -- 
so I think one of my parameters must be off. Just adding (or removing) 
a carbonyl drops my score to 0.43.
I made this notebook example: 
http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7


To the RDKit question: GetMorganFingerprintAsBitVect and 
GetMorganFingerprint give different tanimoto scores (with same radius: 
2).  This is of course because for the explicit bit vector we can set 
the length of the vector/fingerprint.  Is there an equivalence between 
the two? (say using n bits gives same results as 
GetMorganFingerprint).  How come the GetMorganFingerprint method has 
no user-defined length for the fingerprint?  What are the hashed 
equivalents of these fingerprints (e.g. GetHashedMorganFingerprint) ?


Take care,
JP

ps A small suggestion, if I am allowed.  The fingerprint classes could 
do with an informative toString (or non Java equivalent) - I know 
there is ToBitString, but you need to call that explicitly when printing




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Molecular dis / similarity using fingerprints

2015-05-25 Thread Greg Landrum
Hi JP.


On Mon, May 25, 2015 at 4:10 PM, JP jeanpaul.ebe...@inhibox.com wrote:


 I have a set of molecules (~30,000) which I would like to get a structural
 diversity index for.  So I thought easy - generate some fingerprint I
 fancy (ECFP-like, rad 2), take a threshold I fancy (0.7), select a
 similarity metric I fancy (Tanimoto) and apply these to the set in a
 pairwise fashion (you can only do this for a small-ish number of
 molecules).  The resulting distribution of Tanimoto scores defines the
 similarity (or dissimilarity) of the set.

 First of all is there a better way to do this? Does anyone have a feel for
 the numbers to use (fingerprint type, radius, no of bits)?  Is there some
 'Industry standard'?  Which method should I use
 GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I wanted
 ECFP like fingerprints) ?  What determines when to use one over the other?


The two functions use the same algorithm for identifying features in the
molecule, but they return different object types. GetMorganFingerprint()
returns a sparse int vector 2^32 elements long containing the counts of the
number of times each feature appears. GetMorganFingerprintAsBitVect()
returns a bit vector (standard fingerprint) nBits long (nBits is an
argument) that indicates whether or not a particular feature is present.

Similarities calculated using the two fingerprints are highly correlated (
http://rdkit.blogspot.ch/2013/10/comparing-fingerprints-to-each-other.html),
but certainly not identical.


 All my scores are rather low even for relatively similar structures -- so
 I think one of my parameters must be off.  Just adding (or removing) a
 carbonyl drops my score to 0.43.
 I made this notebook example:
 http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7


As Tim pointed out, if you change the aromaticity of a system (which
adding/removing the carbonyl does), it can have a dramatic impact on the
similarity. That is what's going on here.


To the RDKit question: GetMorganFingerprintAsBitVect and
 GetMorganFingerprint give different tanimoto scores (with same radius: 2).
 This is of course because for the explicit bit vector we can set the length
 of the vector/fingerprint.  Is there an equivalence between the two? (say
 using n bits gives same results as GetMorganFingerprint).  How come the
 GetMorganFingerprint method has no user-defined length for the
 fingerprint?  What are the hashed equivalents of these fingerprints (e.g.
 GetHashedMorganFingerprint) ?


The other two were explained above; GetHashedMorganFingerprint() returns a
count vector of a user-specified length (instead of being 2^32 long).


 ps A small suggestion, if I am allowed.  The fingerprint classes could do
 with an informative toString (or non Java equivalent) - I know there is
 ToBitString, but you need to call that explicitly when printing


Do you mean that you'd like print fingerprint from python to show
something about the value of the fingerprint instead of just what type it
is? This would be inconsistent from the rest of the RDKit objects, but
thinking about revisiting how all of that is done could make sense.

-greg
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss