Re: [Rdkit-discuss] Optimizing SSS in the RDKit
On Feb 22, 2009, at 12:51 PM, Greg Landrum wrote: BTW: looks like you forgot the attachment. D'oh! # Written by Andrew Dalke da...@dalkescientific.com # Andrew Dalke Scientific, AB, Gothenburg, Sweden # This document has been released to the public domain. # Data set comes CACTVS keys, generated by # gzcat ~/databases/Compound_241*.sdf.gz | # awk 'flg==1{id=$1;flg=0} /PUBCHEM_COMPOUND_CID/ {flg=1} /^AAAD.*==/ {print $1, id}' # cactvs_keys.dat # Result contains lines which look like: # # AAADccB6OAAHAAAwYAABQAAAHgIQCA6hkCIwzoLABACIACXSWAKCCAAhJ0AIiABGb4gPJiPFs5/HOCjk1BHa6AeQQCBAAACACBAAQIAAAQAQIA== 2411 # AAADceB7MQBEAAAwYMEAAACBUAAAHwYQDQqF2CizwIPAAAioAiVSdACCEAFlBxAJiAEAZsgIIDrB35GEIYhglADIyccYiMCOiABCAAAQAIQAAA== 2412 Here are results from running this for different data sets and queries # The known hit comes from cid:24100068 # The non-hit comes from cid:24776000 # This test set (Compound_241*) has 98837 fingerprints. # Trie uses 59 MB, bits use 14 MB # Trie takes 11* time of regular search for a known hit trie creation 8.29011487961 int creation 1.60190391541 trie search time 0.557299137115 int search time 0.0491080284119 int search finds 15 trie search finds 15 Are they the same? True Number of fingerprints: 98837 # Same data set, with a fingerprint which isn't in the data set # Trie is 1.2* the speed of regular search (Note: no hits) trie creation 8.09642100334 int creation 1.41214299202 trie search time 0.0511682033539 int search time 0.0443210601807 int search finds 0 trie search finds 0 Are they the same? True Number of fingerprints: 98837 # Using Compound_241* and Compound_242*; together 198819 fingerprints # Trie uses 105 MB, bits use 28 MB # With a known hit, trie takes 10* the time of regular search trie creation 17.7659249306 int creation 3.26757502556 trie search time 0.725470066071 int search time 0.0877990722656 int search finds 30 trie search finds 30 Are they the same? True Number of fingerprints: 198819 # With a known negative, trie is 20% faster than regular search (NOTE! no hits) trie creation 17.7014229298 int creation 3.35999107361 trie search time 0.0733439922333 int search time 0.0909128189087 int search finds 0 trie search finds 0 Are they the same? True Number of fingerprints: 198819 import time # Decode and grab the first 384 bits, to match Smellie's fp size def decode_cactvs_fp(encoded_fp): return encoded_fp[4:].decode(base64)[:384//8] # The order of the bits is very important. Reverse the bits and the trie search time is very slow # This is because in reverse order most fingerprints have a lot of leading 00s # No branching occurs, so more work during the search. #return encoded_fp[4:].decode(base64)[::-1][:384//8] # convert into a Python integer. This is the fastest way to # do the substructure tests in pure Python. def decode_cactvs_fp_as_int(encoded_fp): return int(encoded_fp[4:].decode(base64)[:384//8].encode(hex), 16) return int(encoded_fp[4:].decode(base64)[::-1][:384//8].encode(hex), 16) # This is the leaf node. All values in the node have the same # fingerprint. The tail is non-empty when there is only one leaf # node under a given entry in a node. This occurs because most leaves # have no siblings (Proof: that would require about 2**(384/8) == # 2**48 == a lot of structures) # In that case, tail is a compact representation of the linear chain # of nodes along with the leaf which would contain the same set of # identifier values. class Leaf(object): # memory optimization __slots__ = [tail, values] def __init__(self, tail): self.tail = tail self.values = set() def dump(self, depth): print *depth, self.tail.encode(hex), for value in sorted(self.values): print value, print # A Node contains up to 2**8 == 256 branches, one for each possible byte value. # The child can be another Node or a Leaf. class Node(object): # memory optimization __slots__ = [children] def __init__(self): self.children = {} def add(self, text, value, offset=0): c = text[offset] x = self.children.get(c) if x is None: self.children[c] = y = Leaf(text[offset+1:]) y.values.add(value) elif isinstance(x, Node): x.add(text, value, offset+1) elif isinstance(x, Leaf): #print compare, repr(x.tail), repr(text[offset+1:]) if x.tail == text[offset+1:]: x.values.add(value) elif not x.tail: assert offset+1 == len(text) x.values.add(value) else: c2 = x.tail[0] x.tail = x.tail[1:] self.children[c] = y = Node() y.children[c2] = x y.add(text, value, offset+1) else: raise
Re: [Rdkit-discuss] Optimizing SSS in the RDKit
On Feb 17, 2009, at 8:30 PM, Andrew Dalke wrote: Thanks for the link. I've read [Andrew Smellie's] paper now. It's very nicely written, clear, with good background, description, and results, ... Andrew Smellie didn't take what I thought would be the obvious next step and use a trie (probably a Patricia trie) to store things. ... Now I want to try it out! :) I made a first pass implementation of my trie idea using Python. It's attached. My quick summary is: fingerprint size of time for size of time for time for size bits bits trie trie trie (no hits) === === === 9883714 MB 0.05s59 MB0.6s 0.04s 19881928 MB 0.09s 105 MB0.7s 0.07s Based on the paper, it seems his C code requires about 160 MB for 200,000 compounds so there's a clear savings due to at least one of the two memory savings techniques I tried. I expect faster performance for the regular based screens if I were to do the test directly in C. I'm guessing a factor of 3. I expect an even better performance increase and reduced memory use were I to rewrite the trie in C. There I expect about 10x-30x performance, which brings them to the same times. But that's a scientific wild-ass guess. There's no real conclusion you can draw from these numbers other than that I've managed to save a lot of memory and that this is an avenue worth considering. Andrew da...@dalkescientific.com
Re: [Rdkit-discuss] Optimizing SSS in the RDKit
On Feb 17, 2009, at 12:40 PM, Greg Landrum wrote: Well, now I'm incredibly behind in all this. I will try to slowly catch up. That'll teach you not to take a vacation. ;) Seriously though, I was writing as I worked, which means there's a lot of verbiage and places where I wasn't clear on things. The last email puts everything together. I've generated a new, larger, testing dataset using the pubchem HTS compounds. I will also post the details on those (hopefully this morning). Cool. I've asked a few people/lists for data sets but no response yet. There's a few I'll try. I don't know Judy trees. Do you have a reference/pointer? Oops, judy array http://judy.sourceforge.net/ http://en.wikipedia.org/wiki/Judy_array and I did a (buggy as it turns out) wrapper at http://www.dalkescientific.com/Python/PyJudy.html when I last looked into substructure fp filters. My idea then and now was to store a mapping from: unique path identifier - sorted list of matching compounds Substructure filtering is the same as generating all paths and finding the intersection of the sorted lists. I think this is called an inverted index. Most paths are rare, so storing all those paths doesn't take much space. I was thinking that a sorted list works better than a hash or normal trie because I could do an N-way merge to find the intersection, rather than a lot of membership tests. But in reflection, the latter may be faster. Looks like more testing will occur. They aren't by any chance connected to the thing presented in Andrew Smellie's recent paper (haven't read it yet)? http://pubs.acs.org/doi/abs/10.1021/ci800325v Not at all. I really need to visit the library soon. Or pay $30 for 24 hour access to ACS, plus unknown price for access to Ullmann's paper. I think it's worth looking into branched paths as well for real substructure searches. People don't query with linear fragments all that often, so it seems like it would be a win. While people don't query with liner fragments, more complex structures contain linear subparts, including crossing paths. My thought was that linear paths are easy to generate and canonicalize, and would give a baseline limit to more sophisticated schemes. Andrew da...@dalkescientific.com
Re: [Rdkit-discuss] Optimizing SSS in the RDKit
On Feb 13, 2009, at 6:41 AM, Greg Landrum wrote: I'm leaving for vacation this morning and have limited time, so I'm going to just attach my test data. The rest I'm really looking forward to spending some time with later. I will have email access while gone, but I won't be as responsive as normal. No worries. I have preliminary conclusions for my approach. Summary: it looks like a small fingerprint (64 bits) can filter 80% of the data set, but linear paths isn't going to get better than that. I used my top 271 bits to encode fingerprints for the queries and targets and found that on average my query fingerprints only screened the data set from 1000 compounds down to 222. That's 78% filtered compared to what Greg found: Experiment 2: reduce fps to 512 bits RDK fingerprints: filter out 441529 (54%) Layered: filter out 710647 (86%) 10-15% faster I'm regenerating my data set to have 512 bits so I can do a more direct comparison. ... Done. Interesting. It doesn't change the result. I still filter out only 78% of the structures. Now I'll try training based on the query structures. That's cheating, but I'm curious ... Also interesting. This is a 186 bit fingerprint, and the average number of compounds which are not filtered is .. 221! (78% filtered) To that I added a few hand-coded patterns: C=C, C#C, *1**1, *1***1, *11, *1*1, *1**1 and got 197 (80% filtered). I tried a smaller, 56 bit fingerprint, which on average does not filter 258 compounds (74% filtered). Add in that handful of extra patterns and it's 222 compounds (78% unfiltered). Code is at http://pastebin.com/m45f6ebd9 The numbers I get are a pretty solid 80% filter rate. I can get that even with a 64 bit fingerprint, which is interesting because that can be stored in a single long int. I can do a bit better during path selection by breaking ties (for paths which are closest to being in 1/2 of the structures in a cluster) based on using the path with the shorter path length. I also tried using 1/3 instead of 1/2 but I don't think there's a big difference. I wasn't rigorous though. The worst queries, btw, are: O = 989 unfiltered CC#C = 988 unfiltered (I have no fingerprint with a #) c1ccoc1 = 986 unfiltered (I don't handle cycles). Note: two occurrences: [H]c1ccoc1 on line 46 [H]c1occc1 on line 299 c1c1 = 975 unfiltered (again, no cycles) Note: two occurrences [H]c1([H])c1 on line 53 [H]c1c1 on line 708 Hmm. There are many duplicates in the queries: (columns are number of compounds not filtered, canonicalized query pattern) 915 c1ccncc1 915 c1ccncc1 915 c1ccncc1 915 c1ccncc1 915 c1ccncc1 918 Cc1c[nH]cn1 918 Cc1c[nH]nn1 918 Cc1c[nH]nn1 925 CC=NN 927 c1c2c([nH]cn2)ncn1 927 c1c2c(nc[nH]2)ncn1 927 c1cncnc1 927 c1cncnc1 927 c1cncnc1 928 N 928 NN 928 NN 928 c1[nH]nnn1 928 c1cnc[nH]1 928 c1cnccn1 928 c1cnn[nH]1 928 c1ncncn1 975 c1c1 975 c1c1 986 c1ccoc1 986 c1ccoc1 988 CC#C 989 O To continue in this path I would need to rerun things so I generate fragments in query space but filter them in target space, using the greedy algorithm to pick fragments which increase the ability to filter. That'll have to be some other day. Andrew da...@dalkescientific.com
Re: [Rdkit-discuss] Optimizing SSS in the RDKit
On Feb 10, 2009, at 3:11 PM, Greg Landrum wrote: It would be quite interesting to use the new Ullmann code as a framework and do an implementation of the VF or VF2 algorithms used in vflib. I was surprised to see that there's a public BGL implementation of Ullman. I've seen mentions that people had in-house versions but that was it. I admit though that I haven't tracked CDL. I see that this Ullman implementation is faster than the VFLib-based one. The CDK people have different experience with their implementations. But you wrote that you had to convert between the BGL and VFLib graphs, which explains at least part of the advantage. To get a set of molecule-like queries, I fragmented those 1000 molecules using RECAP and kept the 823 unique fragments I got. I've wondered if it would be possible to get a set of representative structure queries from PubChem or ChemSpider. But I figured there would be some privacy issues because some people might have inappropriately submitted internal proprietary structures. I know when I did bioinformatics work people did that with their sequences. Bioinformatics isn't as secretive as cheminformatics though. The layered fps are clearly more robust w.r.t. fingerprint size (which makes sense: I only set one bit per path there as opposed to 4 per path for the RDKit fp; a good experiment would be to try the RDKit fps with one bit per path). There ought to be some math that can help out. I tried working it out but came to the conclusion that there must be a large set of well correlated bits. Why? Because if p and q are the probabilities that bits are set in the query and the target, then P(a given bit screens) = P(query bit is on) * P(target bit is off) = p * (1-q) P(query is a subset of the target) = = (1-P(a given bit screens))**(number of on-bits) = (1-p*(1-q))**(N*p) and if p and q were, say, 0.2 then the odds of a random query fingerprint being a subset of a target fingerprint is about 1E-32. While Greg measured about 7%. That leads to one thought - if you were to filter out the top few hundred or so subgraph seeds (the characteristic values you use to see the PRNG), would that make the filters more selective? Handwaving, C and CC occur in nearly everything then those bits aren't very selective. Since there are several hundred subgraphs that can map to a given bit, these common bits obscure more selective tests. Hmmm, what does a correlation matrix plot between the fingerprint bits look like? I haven't gotten my hands dirty, as it were, with this sort of data so I don't have a good intuition for what works well and where limitations are. So what does the community think? Interesting results? Arguments about my testing/benchmarking methodology? Obvious next steps? Suggestions for improving the layered fps so that they're even more discriminating? You compared the two different hash fingerprint types to see if their combination was more effective than one along. What about doing the same with the MACCS keys? The combination of two different types of filters may prove better. Otherwise, I don't have any good suggestions. Andrew da...@dalkescientific.com
Re: [Rdkit-discuss] Optimizing SSS in the RDKit
Would be interesting to take a set of compounds and look at a correlation matrix - maybe one can identify a set of generally discriminating bits that can be used for screening ? Probably not but it could be worth a try ... then memory would go down as well as discriminating power up? Nik Andrew Dalke da...@dalkescientific.com 12.02.2009 14:56 To RDKit Discuss rdkit-discuss@lists.sourceforge.net cc Subject Re: [Rdkit-discuss] Optimizing SSS in the RDKit On Feb 12, 2009, at 8:46 AM, Greg Landrum wrote: I'm either not understanding completely or I disagree. The queries were constructed by fragmenting the molecules I searched through, so I'd expect lots of substructure hits (and a lower screen-out rate that arbitrary queries against arbitrary molecules). Ahh, of course. But I don't think fingerprint screen give, say, 0.001% false rates. I think they are more in line with what you found. But if the bit distributions were really uncorrelated for molecules where one is not a substructure of the other, then I would expect extremely low false positive rates. 2048 bits should give a lot of discrimination power if the bits weren't correlated. That's a good idea to add to the list of things to look into. It's also relatively easy to do because it probably just involves increasing the minimum path length included in fingerprints (at least as a first step). Again, I don't have experience with that, but it means that there's less ability to handle unlikely atom types. Yes, the larger subgraphs will include them. Don't know. Looking at MACCS is a good idea. I'll also put that on the list. Is this list on a wiki? ;) Andrew da...@dalkescientific.com -- Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) software. With Adobe AIR, Ajax developers can use existing skills and code to build responsive, highly engaging applications that combine the power of local resources and data with the reach of the web. Download the Adobe AIR SDK and Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _ CONFIDENTIALITY NOTICE The information contained in this e-mail message is intended only for the exclusive use of the individual or entity named above and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivery of the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by e-mail and delete the material from any computer. Thank you.
Re: [Rdkit-discuss] Optimizing SSS in the RDKit
On Thu, Feb 12, 2009 at 2:55 PM, Andrew Dalke da...@dalkescientific.com wrote: On Feb 12, 2009, at 8:46 AM, Greg Landrum wrote: I'm either not understanding completely or I disagree. The queries were constructed by fragmenting the molecules I searched through, so I'd expect lots of substructure hits (and a lower screen-out rate that arbitrary queries against arbitrary molecules). Ahh, of course. But I don't think fingerprint screen give, say, 0.001% false rates. I think they are more in line with what you found. But if the bit distributions were really uncorrelated for molecules where one is not a substructure of the other, then I would expect extremely low false positive rates. 2048 bits should give a lot of discrimination power if the bits weren't correlated. Agreed, the bit correlation experiment should be done. That's a good idea to add to the list of things to look into. It's also relatively easy to do because it probably just involves increasing the minimum path length included in fingerprints (at least as a first step). Again, I don't have experience with that, but it means that there's less ability to handle unlikely atom types. Yes, the larger subgraphs will include them. Don't know. I suspect the less common atom types aren't a big concern since the larger subgraphs will include them and any sugraph isomorphism involving them will go very quickly (since most things will be screened out in the atom-atom mapping phase) Looking at MACCS is a good idea. I'll also put that on the list. Is this list on a wiki? ;) Not yet, but I just put up the page for it: http://code.google.com/p/rdkit/wiki/SubstructureSearchOptimization Now I just need to populate it. -greg
[Rdkit-discuss] Optimizing SSS in the RDKit
Dear all Andrew's question about fingerprints hit me at the right time: I had just finished doing some optimization work on the RDKit substructure search machinery (removing the vflib dependency). The details are here: http://code.google.com/p/rdkit/wiki/SubgraphIsomorphismOptimization It would be quite interesting to use the new Ullmann code as a framework and do an implementation of the VF or VF2 algorithms used in vflib. Of course there's no better way to optimize subgraph isomorphism than to avoid it all together, which is where the fingerprints mentioned come in. I'm spending a couple of days home from work (with a cold), so I have some room to explore here a little bit. I put together a sandbox using my 1000 pubchem molecules (they're from the HTS set, so they are all either drug-like or lead-like, whatever that means). To get a set of molecule-like queries, I fragmented those 1000 molecules using RECAP and kept the 823 unique fragments I got. I've been using those 823 molecules to query the full set of 1000 molecules and looking at how many calls to the isomorphism code I can avoid using either the RDKit (daylight-like) fingerprints or the layered fingerprints (out to layer 0x4, beyond that these aren't suitable for SSS). The results look pretty encouraging: I can easily filter out more than 90% of the comparisons via fingerprints without losing anything. There are 823000 (823x1000) possible comparisons with my dataset; using the RDKit fingerprints as a screen I filter out 765534 of them (93%) using the layered fingerprints I filter out 765224 (also 93%). The screening [not even remotely optimized, I'm calculating (AB)==A instead of doing it on the fly and short circuiting when something mismatches] takes about 10 seconds in each case. By default each fingerprint uses 2048 bits. I can shrink this by folding the fingerprints (or generating them shorter in the first place... the end result is the same). That potentially gains speed and certainly saves storage space, but there may be a cost at how discriminating the fingerprints are. Experiment 1: reduce fps to 1024 bits RDK fingerprints: filter out 717356 (87%) Layered: filter out 752948 (91%) No obvious speed improvement Experiment 2: reduce fps to 512 bits RDK fingerprints: filter out 441529 (54%) Layered: filter out 710647 (86%) 10-15% faster The layered fps are clearly more robust w.r.t. fingerprint size (which makes sense: I only set one bit per path there as opposed to 4 per path for the RDKit fp; a good experiment would be to try the RDKit fps with one bit per path). They're also faster to generate (they no longer require a PRNG). I think the screening speed thing is a bit of a red herring at the moment since I'm not doing a smart screen, but there is a real impact on storage space. So what does the community think? Interesting results? Arguments about my testing/benchmarking methodology? Obvious next steps? Suggestions for improving the layered fps so that they're even more discriminating? -greg
Re: [Rdkit-discuss] Optimizing SSS in the RDKit
Hi Greg, just a curiosity ... 765534 vs 76522 is one a subset of the other? If not - would it make sense to test on both? Just a thought. Apart from that I think the setup is reasonable for most applications we will have ... Nik Greg Landrum greg.land...@gmail.com 10.02.2009 15:11 To RDKit Discuss rdkit-discuss@lists.sourceforge.net cc Subject [Rdkit-discuss] Optimizing SSS in the RDKit Dear all Andrew's question about fingerprints hit me at the right time: I had just finished doing some optimization work on the RDKit substructure search machinery (removing the vflib dependency). The details are here: http://code.google.com/p/rdkit/wiki/SubgraphIsomorphismOptimization It would be quite interesting to use the new Ullmann code as a framework and do an implementation of the VF or VF2 algorithms used in vflib. Of course there's no better way to optimize subgraph isomorphism than to avoid it all together, which is where the fingerprints mentioned come in. I'm spending a couple of days home from work (with a cold), so I have some room to explore here a little bit. I put together a sandbox using my 1000 pubchem molecules (they're from the HTS set, so they are all either drug-like or lead-like, whatever that means). To get a set of molecule-like queries, I fragmented those 1000 molecules using RECAP and kept the 823 unique fragments I got. I've been using those 823 molecules to query the full set of 1000 molecules and looking at how many calls to the isomorphism code I can avoid using either the RDKit (daylight-like) fingerprints or the layered fingerprints (out to layer 0x4, beyond that these aren't suitable for SSS). The results look pretty encouraging: I can easily filter out more than 90% of the comparisons via fingerprints without losing anything. There are 823000 (823x1000) possible comparisons with my dataset; using the RDKit fingerprints as a screen I filter out 765534 of them (93%) using the layered fingerprints I filter out 765224 (also 93%). The screening [not even remotely optimized, I'm calculating (AB)==A instead of doing it on the fly and short circuiting when something mismatches] takes about 10 seconds in each case. By default each fingerprint uses 2048 bits. I can shrink this by folding the fingerprints (or generating them shorter in the first place... the end result is the same). That potentially gains speed and certainly saves storage space, but there may be a cost at how discriminating the fingerprints are. Experiment 1: reduce fps to 1024 bits RDK fingerprints: filter out 717356 (87%) Layered: filter out 752948 (91%) No obvious speed improvement Experiment 2: reduce fps to 512 bits RDK fingerprints: filter out 441529 (54%) Layered: filter out 710647 (86%) 10-15% faster The layered fps are clearly more robust w.r.t. fingerprint size (which makes sense: I only set one bit per path there as opposed to 4 per path for the RDKit fp; a good experiment would be to try the RDKit fps with one bit per path). They're also faster to generate (they no longer require a PRNG). I think the screening speed thing is a bit of a red herring at the moment since I'm not doing a smart screen, but there is a real impact on storage space. So what does the community think? Interesting results? Arguments about my testing/benchmarking methodology? Obvious next steps? Suggestions for improving the layered fps so that they're even more discriminating? -greg -- Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) software. With Adobe AIR, Ajax developers can use existing skills and code to build responsive, highly engaging applications that combine the power of local resources and data with the reach of the web. Download the Adobe AIR SDK and Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _ CONFIDENTIALITY NOTICE The information contained in this e-mail message is intended only for the exclusive use of the individual or entity named above and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivery of the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by e-mail and delete the material from any computer. Thank you.