Re: [Rdkit-discuss] RDkit cartridge on Amazon RDS
I followed up with AWS on this, and they they stated: For the extension to be available we need to test it and white-list the extension. I have put in a feature request for the module to added in the future. Please note that this may take some time and there is no definite deadline at this point in time. So, maybe this will happen sometime, but don't hold your breath. Tim On 12/05/2015 02:53, Greg Landrum wrote: Hi Tim, On Mon, May 11, 2015 at 6:25 PM, Tim Dudgeon tdudgeon...@gmail.com mailto:tdudgeon...@gmail.com wrote: I wondered if anyone has considered trying to get the RDkit cartridge running on Amazon Relational Database Service (RDS)? When using this you don't get a shell on the underlying server and everything has to be done directly at the database level. This seems to be the most relevant information: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#SQLServer.Concepts.General.FeatureSupport It looks like you would need Amazon to add the rdkit to the list of supported extensions. This would be quite cool and, depending on what the underlying operating system is, may be very easy for them to do. -greg -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Molecular dis / similarity using fingerprints
RDKitters, I have a partial RDKit / partial Methodology question. I hope this email isn't much of the how long is a piece of string nature. I have a set of molecules (~30,000) which I would like to get a structural diversity index for. So I thought easy - generate some fingerprint I fancy (ECFP-like, rad 2), take a threshold I fancy (0.7), select a similarity metric I fancy (Tanimoto) and apply these to the set in a pairwise fashion (you can only do this for a small-ish number of molecules). The resulting distribution of Tanimoto scores defines the similarity (or dissimilarity) of the set. First of all is there a better way to do this? Does anyone have a feel for the numbers to use (fingerprint type, radius, no of bits)? Is there some 'Industry standard'? Which method should I use GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I wanted ECFP like fingerprints) ? What determines when to use one over the other? All my scores are rather low even for relatively similar structures -- so I think one of my parameters must be off. Just adding (or removing) a carbonyl drops my score to 0.43. I made this notebook example: http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7 To the RDKit question: GetMorganFingerprintAsBitVect and GetMorganFingerprint give different tanimoto scores (with same radius: 2). This is of course because for the explicit bit vector we can set the length of the vector/fingerprint. Is there an equivalence between the two? (say using n bits gives same results as GetMorganFingerprint). How come the GetMorganFingerprint method has no user-defined length for the fingerprint? What are the hashed equivalents of these fingerprints (e.g. GetHashedMorganFingerprint) ? Take care, JP ps A small suggestion, if I am allowed. The fingerprint classes could do with an informative toString (or non Java equivalent) - I know there is ToBitString, but you need to call that explicitly when printing -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecular dis / similarity using fingerprints
ECFP, like all fingerprints are notoriously sensitive to trivial structural changes so your example is no surprise (the extra carbonyl group also affects aromaticity depending on your view of aromaticity). But I don't think there is anything simple and fast that's better. Maybe a clustering approach may work? Something like sphere exclusion clustering with counting the number of clusters at 0.9 - 0.8 similarity)? With 30K structures it sounds computationally tractable? Tim On 25/05/2015 15:10, JP wrote: RDKitters, I have a partial RDKit / partial Methodology question. I hope this email isn't much of the how long is a piece of string nature. I have a set of molecules (~30,000) which I would like to get a structural diversity index for. So I thought easy - generate some fingerprint I fancy (ECFP-like, rad 2), take a threshold I fancy (0.7), select a similarity metric I fancy (Tanimoto) and apply these to the set in a pairwise fashion (you can only do this for a small-ish number of molecules). The resulting distribution of Tanimoto scores defines the similarity (or dissimilarity) of the set. First of all is there a better way to do this? Does anyone have a feel for the numbers to use (fingerprint type, radius, no of bits)? Is there some 'Industry standard'? Which method should I use GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I wanted ECFP like fingerprints) ? What determines when to use one over the other? All my scores are rather low even for relatively similar structures -- so I think one of my parameters must be off. Just adding (or removing) a carbonyl drops my score to 0.43. I made this notebook example: http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7 To the RDKit question: GetMorganFingerprintAsBitVect and GetMorganFingerprint give different tanimoto scores (with same radius: 2). This is of course because for the explicit bit vector we can set the length of the vector/fingerprint. Is there an equivalence between the two? (say using n bits gives same results as GetMorganFingerprint). How come the GetMorganFingerprint method has no user-defined length for the fingerprint? What are the hashed equivalents of these fingerprints (e.g. GetHashedMorganFingerprint) ? Take care, JP ps A small suggestion, if I am allowed. The fingerprint classes could do with an informative toString (or non Java equivalent) - I know there is ToBitString, but you need to call that explicitly when printing -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Molecular dis / similarity using fingerprints
Hi JP. On Mon, May 25, 2015 at 4:10 PM, JP jeanpaul.ebe...@inhibox.com wrote: I have a set of molecules (~30,000) which I would like to get a structural diversity index for. So I thought easy - generate some fingerprint I fancy (ECFP-like, rad 2), take a threshold I fancy (0.7), select a similarity metric I fancy (Tanimoto) and apply these to the set in a pairwise fashion (you can only do this for a small-ish number of molecules). The resulting distribution of Tanimoto scores defines the similarity (or dissimilarity) of the set. First of all is there a better way to do this? Does anyone have a feel for the numbers to use (fingerprint type, radius, no of bits)? Is there some 'Industry standard'? Which method should I use GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I wanted ECFP like fingerprints) ? What determines when to use one over the other? The two functions use the same algorithm for identifying features in the molecule, but they return different object types. GetMorganFingerprint() returns a sparse int vector 2^32 elements long containing the counts of the number of times each feature appears. GetMorganFingerprintAsBitVect() returns a bit vector (standard fingerprint) nBits long (nBits is an argument) that indicates whether or not a particular feature is present. Similarities calculated using the two fingerprints are highly correlated ( http://rdkit.blogspot.ch/2013/10/comparing-fingerprints-to-each-other.html), but certainly not identical. All my scores are rather low even for relatively similar structures -- so I think one of my parameters must be off. Just adding (or removing) a carbonyl drops my score to 0.43. I made this notebook example: http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7 As Tim pointed out, if you change the aromaticity of a system (which adding/removing the carbonyl does), it can have a dramatic impact on the similarity. That is what's going on here. To the RDKit question: GetMorganFingerprintAsBitVect and GetMorganFingerprint give different tanimoto scores (with same radius: 2). This is of course because for the explicit bit vector we can set the length of the vector/fingerprint. Is there an equivalence between the two? (say using n bits gives same results as GetMorganFingerprint). How come the GetMorganFingerprint method has no user-defined length for the fingerprint? What are the hashed equivalents of these fingerprints (e.g. GetHashedMorganFingerprint) ? The other two were explained above; GetHashedMorganFingerprint() returns a count vector of a user-specified length (instead of being 2^32 long). ps A small suggestion, if I am allowed. The fingerprint classes could do with an informative toString (or non Java equivalent) - I know there is ToBitString, but you need to call that explicitly when printing Do you mean that you'd like print fingerprint from python to show something about the value of the fingerprint instead of just what type it is? This would be inconsistent from the rest of the RDKit objects, but thinking about revisiting how all of that is done could make sense. -greg -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss