Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Maciek Wójcikowski
One small notice from me - I would still use other agregative function instead of sum to get binary FP: np.reshape(fpa, (4, -1)).any(axis = 0) I guess it doesn't change a thing with tanimoto, but if you try other distances then you can get unexpected results (assuming there are crashes). Pozd

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Jing Lu
Thanks, Greg, Yes, sciket learn will automatically promote to arrays of float with check_array() function. What I am currently doing is fpa = numpy.zeros((len(fp),),numpy.double) DataStructs.ConvertToNumpyArray(fp,fpa) np.sum(np.reshape(fpa, (4, -1)), axis = 0) Is this the same as FoldFingerpr

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Greg Landrum
If that doesn't help (and it may not since some Scikit-Learn functions automatically promote their arguments to arrays of doubles), you can always just generate a shorter fingerprint from the beginning (all the fingerprinting functions take an optional argument for this) or fold the existing finger

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Maciek Wójcikowski
Hi Jing, Most fingerprints are binary, thus can be stored as np.bool_, which compared to double should be 64 times more memory efficient. Best, Maciej Pozdrawiam, | Best regards, Maciek Wójcikowski mac...@wojcikowski.pl 2015-08-27 16:15 GMT+02:00 Jing Lu : > Hi Greg, > > Thanks! It work

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Jing Lu
Hi Greg, Thanks! It works! But, is that possible to fold the fingerprint to smaller size? np.zeros((100,2048)) still takes a lot of memory... Best, Jing On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum wrote: > > On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu wrote: > >> >> So, I wonder is there

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-26 Thread Greg Landrum
On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu wrote: > > So, I wonder is there any way to convert fingerprint to a numpy vector? > Indeed there is: In [11]: from rdkit import Chem In [12]: from rdkit import DataStructs In [13]: import numpy In [14]: m =Chem.MolFromSmiles('C1CCC1') In [15]: fp =

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-26 Thread Jing Lu
Sorry to bother again... Now, the most time consuming part is clustering. The process getting the fingerprints only takes less than 1h. But, the process for clustering has already taken more than 30h, and I am not sure when it will finish. Currently, I use scikit learn DBSCAN, which has time comp

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 6:38 PM, Jing Lu wrote: > I hope the memory issue won't be a problem. That's up to you and your choice of threshold. > Most AgglomerativeClustering algorithms have time complexity with N^2. Will > that be a problem? You have to decided for yourself what counts as a problem.

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Dimitri Maziuk
On 08/23/2015 11:38 AM, Jing Lu wrote: > Thanks, Andrew! > > Yes, I was thinking about using scikit-learn also. But I guess I need to > use a data structure for sparse matrix and define a function for > connectivity. I hope the memory issue won't be a problem. > Most AgglomerativeClustering algori

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Takayuki, For both Repeated Bisection clustering and K-means clustering, they all need the number of clusters as input, right? Best, Jing On Sun, Aug 23, 2015 at 1:17 AM, Taka Seri wrote: > Dear Jing, > > How about your trying using bayon ? > https://code.google.com/p/bayon/ > It's no

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Andrew! Yes, I was thinking about using scikit-learn also. But I guess I need to use a data structure for sparse matrix and define a function for connectivity. I hope the memory issue won't be a problem. Most AgglomerativeClustering algorithms have time complexity with N^2. Will that be a

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 3:43 AM, Jing Lu wrote: > If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I > calculate the distance between every pair of molecules, the size of distance > matrix will be too big. Does RDKit support any heuristic clustering algorithm > without cal

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Taka Seri
Dear Jing, How about your trying using bayon ? https://code.google.com/p/bayon/ It's not function of RDKit, but I think the library can cluster molecules using ECFP4. Unfortunately, input file format of bayon is not distance matrix but easy to prepare the format. Best regards. Takayuki 2015年8

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Jing Lu
Currently, I prefer fingerprint based clustering, because it's hard to set the cutoff for scaffold based clustering. Does RDKit have scaffold based clustering? On Sat, Aug 22, 2015 at 10:56 PM, wrote: > Hi, how about scaffold based clustering . You extract the scaffolds and > then cluster it and

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread abhik1368
Hi, how about scaffold based clustering . You extract the scaffolds and then cluster it and then put the respective scaffold compounds inside the cluster . Sent from my iPhone > On Aug 22, 2015, at 8:43 PM, Jing Lu wrote: > > Dear RDKit users, > > If I want to cluster more than 1M molecules