[Rdkit-discuss] Difference between ECFP and MorganFingerprint

2015-09-29 Thread Jing Lu
Dear RDKit community, I was treating AllChem.GetMorganFingerprint(m1,2) the same as ECFP4. I am writing a paper for a open source tool, so I need to be very accurate. I have seen one open source implementation for ECFP, which is from CDK. Most researchers are using Pipeline Pilot to calculate

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Jing Lu
fingerprints are binary, thus can be stored as np.bool_, which compared to double should be 64 times more memory efficient. Best, Maciej Pozdrawiam, | Best regards, Maciek Wójcikowski mac...@wojcikowski.pl 2015-08-27 16:15 GMT+02:00 Jing Lu ajin...@gmail.com: Hi Greg, Thanks

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Jing Lu
Hi Greg, Thanks! It works! But, is that possible to fold the fingerprint to smaller size? np.zeros((100,2048)) still takes a lot of memory... Best, Jing On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com wrote: On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
/bayon/ It's not function of RDKit, but I think the library can cluster molecules using ECFP4. Unfortunately, input file format of bayon is not distance matrix but easy to prepare the format. Best regards. Takayuki 2015年8月23日(日) 12:03 Jing Lu ajin...@gmail.com: Currently, I prefer

[Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Jing Lu
Dear RDKit users, If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I calculate the distance between every pair of molecules, the size of distance matrix will be too big. Does RDKit support any heuristic clustering algorithm without calculating the distance matrix of the