Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Maciek Wójcikowski
One small notice from me - I would still use other agregative function
instead of sum to get binary FP:
np.reshape(fpa, (4, -1)).any(axis = 0)
I guess it doesn't change a thing with tanimoto, but if you try other
distances then you can get unexpected results (assuming there are crashes).


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2015-08-28 17:17 GMT+02:00 Jing Lu :

> Thanks, Greg,
>
> Yes, sciket learn will automatically promote to arrays of float with 
> check_array()
> function. What I am currently doing is
>
>
> fpa = numpy.zeros((len(fp),),numpy.double)
> DataStructs.ConvertToNumpyArray(fp,fpa)
> np.sum(np.reshape(fpa, (4, -1)), axis = 0)
>
>
> Is this the same as FoldFingerprint()?
>
>
> Best,Jing
>
>
>
> On Fri, Aug 28, 2015 at 5:03 AM, Greg Landrum 
> wrote:
>
>> If that doesn't help (and it may not since some Scikit-Learn functions
>> automatically promote their arguments to arrays of doubles), you can always
>> just generate a shorter fingerprint from the beginning (all the
>> fingerprinting functions take an optional argument for this) or fold the
>> existing fingerprints to a new size using the function
>> rdkit.DataStructs.FoldFingerprint().
>>
>> Best,
>> -greg
>>
>>
>> On Thu, Aug 27, 2015 at 4:33 PM, Maciek Wójcikowski <
>> mac...@wojcikowski.pl> wrote:
>>
>>> Hi Jing,
>>>
>>> Most fingerprints are binary, thus can be stored as np.bool_, which
>>> compared to double should be 64 times more memory efficient.
>>>
>>> Best,
>>> Maciej
>>>
>>> 
>>> Pozdrawiam,  |  Best regards,
>>> Maciek Wójcikowski
>>> mac...@wojcikowski.pl
>>>
>>> 2015-08-27 16:15 GMT+02:00 Jing Lu :
>>>
 Hi Greg,

 Thanks! It works! But, is that possible to fold the fingerprint to
 smaller size? np.zeros((100,2048)) still takes a lot of memory...


 Best,
 Jing

 On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum 
 wrote:

>
> On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu  wrote:
>
>>
>> So, I wonder is there any way to convert fingerprint to a numpy
>> vector?
>>
>
> Indeed there is:
>
> In [11]: from rdkit import Chem
>
> In [12]: from rdkit import DataStructs
>
> In [13]: import numpy
>
> In [14]: m =Chem.MolFromSmiles('C1CCC1')
>
> In [15]: fp = Chem.RDKFingerprint(m)
>
> In [16]: fpa = numpy.zeros((len(fp),),numpy.double)
>
> In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)
>
>
> Best,
> -greg
>
>


 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


>>>
>>
>
>
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Jing Lu
Thanks, Greg,

Yes, sciket learn will automatically promote to arrays of float with
check_array()
function. What I am currently doing is


fpa = numpy.zeros((len(fp),),numpy.double)
DataStructs.ConvertToNumpyArray(fp,fpa)
np.sum(np.reshape(fpa, (4, -1)), axis = 0)


Is this the same as FoldFingerprint()?


Best,Jing



On Fri, Aug 28, 2015 at 5:03 AM, Greg Landrum 
wrote:

> If that doesn't help (and it may not since some Scikit-Learn functions
> automatically promote their arguments to arrays of doubles), you can always
> just generate a shorter fingerprint from the beginning (all the
> fingerprinting functions take an optional argument for this) or fold the
> existing fingerprints to a new size using the function
> rdkit.DataStructs.FoldFingerprint().
>
> Best,
> -greg
>
>
> On Thu, Aug 27, 2015 at 4:33 PM, Maciek Wójcikowski  > wrote:
>
>> Hi Jing,
>>
>> Most fingerprints are binary, thus can be stored as np.bool_, which
>> compared to double should be 64 times more memory efficient.
>>
>> Best,
>> Maciej
>>
>> 
>> Pozdrawiam,  |  Best regards,
>> Maciek Wójcikowski
>> mac...@wojcikowski.pl
>>
>> 2015-08-27 16:15 GMT+02:00 Jing Lu :
>>
>>> Hi Greg,
>>>
>>> Thanks! It works! But, is that possible to fold the fingerprint to
>>> smaller size? np.zeros((100,2048)) still takes a lot of memory...
>>>
>>>
>>> Best,
>>> Jing
>>>
>>> On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum 
>>> wrote:
>>>

 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu  wrote:

>
> So, I wonder is there any way to convert fingerprint to a numpy vector?
>

 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg


>>>
>>>
>>> --
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Greg Landrum
If that doesn't help (and it may not since some Scikit-Learn functions
automatically promote their arguments to arrays of doubles), you can always
just generate a shorter fingerprint from the beginning (all the
fingerprinting functions take an optional argument for this) or fold the
existing fingerprints to a new size using the function
rdkit.DataStructs.FoldFingerprint().

Best,
-greg


On Thu, Aug 27, 2015 at 4:33 PM, Maciek Wójcikowski 
wrote:

> Hi Jing,
>
> Most fingerprints are binary, thus can be stored as np.bool_, which
> compared to double should be 64 times more memory efficient.
>
> Best,
> Maciej
>
> 
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2015-08-27 16:15 GMT+02:00 Jing Lu :
>
>> Hi Greg,
>>
>> Thanks! It works! But, is that possible to fold the fingerprint to
>> smaller size? np.zeros((100,2048)) still takes a lot of memory...
>>
>>
>> Best,
>> Jing
>>
>> On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum 
>> wrote:
>>
>>>
>>> On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu  wrote:
>>>

 So, I wonder is there any way to convert fingerprint to a numpy vector?

>>>
>>> Indeed there is:
>>>
>>> In [11]: from rdkit import Chem
>>>
>>> In [12]: from rdkit import DataStructs
>>>
>>> In [13]: import numpy
>>>
>>> In [14]: m =Chem.MolFromSmiles('C1CCC1')
>>>
>>> In [15]: fp = Chem.RDKFingerprint(m)
>>>
>>> In [16]: fpa = numpy.zeros((len(fp),),numpy.double)
>>>
>>> In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)
>>>
>>>
>>> Best,
>>> -greg
>>>
>>>
>>
>>
>> --
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Maciek Wójcikowski
Hi Jing,

Most fingerprints are binary, thus can be stored as np.bool_, which
compared to double should be 64 times more memory efficient.

Best,
Maciej


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2015-08-27 16:15 GMT+02:00 Jing Lu :

> Hi Greg,
>
> Thanks! It works! But, is that possible to fold the fingerprint to smaller
> size? np.zeros((100,2048)) still takes a lot of memory...
>
>
> Best,
> Jing
>
> On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum 
> wrote:
>
>>
>> On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu  wrote:
>>
>>>
>>> So, I wonder is there any way to convert fingerprint to a numpy vector?
>>>
>>
>> Indeed there is:
>>
>> In [11]: from rdkit import Chem
>>
>> In [12]: from rdkit import DataStructs
>>
>> In [13]: import numpy
>>
>> In [14]: m =Chem.MolFromSmiles('C1CCC1')
>>
>> In [15]: fp = Chem.RDKFingerprint(m)
>>
>> In [16]: fpa = numpy.zeros((len(fp),),numpy.double)
>>
>> In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)
>>
>>
>> Best,
>> -greg
>>
>>
>
>
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Jing Lu
Hi Greg,

Thanks! It works! But, is that possible to fold the fingerprint to smaller
size? np.zeros((100,2048)) still takes a lot of memory...


Best,
Jing

On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum 
wrote:

>
> On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu  wrote:
>
>>
>> So, I wonder is there any way to convert fingerprint to a numpy vector?
>>
>
> Indeed there is:
>
> In [11]: from rdkit import Chem
>
> In [12]: from rdkit import DataStructs
>
> In [13]: import numpy
>
> In [14]: m =Chem.MolFromSmiles('C1CCC1')
>
> In [15]: fp = Chem.RDKFingerprint(m)
>
> In [16]: fpa = numpy.zeros((len(fp),),numpy.double)
>
> In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)
>
>
> Best,
> -greg
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-26 Thread Greg Landrum
On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu  wrote:

>
> So, I wonder is there any way to convert fingerprint to a numpy vector?
>

Indeed there is:

In [11]: from rdkit import Chem

In [12]: from rdkit import DataStructs

In [13]: import numpy

In [14]: m =Chem.MolFromSmiles('C1CCC1')

In [15]: fp = Chem.RDKFingerprint(m)

In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


Best,
-greg
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-26 Thread Jing Lu
Sorry to bother again...

Now, the most time consuming part is clustering. The process getting the
fingerprints only takes less than 1h. But, the process for clustering has
already taken more than 30h, and I am not sure when it will finish.

Currently, I use scikit learn DBSCAN, which has time complexity O(nlog(n)).
A more efficient clustering algorithm is miniBatch KMeans. But, Batch
KMeans only take matrix as input.

So, I wonder is there any way to convert fingerprint to a numpy vector?


Thanks,
Jing


On Sun, Aug 23, 2015 at 5:07 PM, Andrew Dalke 
wrote:

> On Aug 23, 2015, at 6:38 PM, Jing Lu wrote:
> > I hope the memory issue won't be a problem.
>
> That's up to you and your choice of threshold.
>
> >  Most AgglomerativeClustering algorithms have time complexity with N^2.
> Will that be a problem?
>
> You have to decided for yourself what counts as a problem. If you want to
> get it done in 1 minute with a threshold of 0.2, then you've got a problem.
> If you're willing to take a month, then there's no problem.
>
> With chemfp, Taylor-Butina clustering, at
> http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering
> , took 35 seconds for 100,000 fingers. The NxN calculation is also N^2
> time, so should only take about an hour for 1 million fingerprints.
>
> Best of course is to start with a smaller system first, see if it works,
> and only then try to scale up. Then you'll have experience of which methods
> are appropriate and what your time constraints are.
>
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 6:38 PM, Jing Lu wrote:
> I hope the memory issue won't be a problem.

That's up to you and your choice of threshold.

>  Most AgglomerativeClustering algorithms have time complexity with N^2. Will 
> that be a problem?

You have to decided for yourself what counts as a problem. If you want to get 
it done in 1 minute with a threshold of 0.2, then you've got a problem. If 
you're willing to take a month, then there's no problem.

With chemfp, Taylor-Butina clustering, at 
http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering 
, took 35 seconds for 100,000 fingers. The NxN calculation is also N^2 time, so 
should only take about an hour for 1 million fingerprints.

Best of course is to start with a smaller system first, see if it works, and 
only then try to scale up. Then you'll have experience of which methods are 
appropriate and what your time constraints are.


Andrew
da...@dalkescientific.com



--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Dimitri Maziuk
On 08/23/2015 11:38 AM, Jing Lu wrote:
> Thanks, Andrew!
> 
> Yes, I was thinking about using scikit-learn also. But I guess I need to
> use a data structure for sparse matrix and define a function for
> connectivity. I hope the memory issue won't be a problem.
> Most AgglomerativeClustering algorithms have time complexity with N^2. Will
> that be a problem?

Usual programming solutions are
- if you don't need the whole matrix in RAM at once, cache it to disk.
Otherwise try to split the job into smaller batches.
- Big-Oh notation is relative complexity. In absolute terms, if it
finishes overnight and you only intend to run it a handful of times, N^2
is not worth worrying about. Otherwise try to split into smaller batches
that you can run in parallel on a cluster of computers.

FWIW
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Takayuki,

For both Repeated Bisection clustering and K-means clustering, they all
need the number of clusters as input, right?


Best,
Jing

On Sun, Aug 23, 2015 at 1:17 AM, Taka Seri  wrote:

> Dear Jing,
>
> How about your trying using bayon ?
> https://code.google.com/p/bayon/
> It's not function of RDKit, but I think the library can cluster molecules
> using ECFP4.
>
> Unfortunately, input file format of bayon is not distance matrix but easy
> to prepare the format.
>
> Best regards.
>
> Takayuki
>
>
> 2015年8月23日(日) 12:03 Jing Lu :
>
>> Currently, I prefer fingerprint based clustering, because it's hard to
>> set the cutoff for scaffold based clustering. Does RDKit have scaffold
>> based clustering?
>>
>> On Sat, Aug 22, 2015 at 10:56 PM,  wrote:
>>
>>> Hi, how about scaffold based clustering . You extract the scaffolds and
>>> then cluster it and then put the respective scaffold compounds inside the
>>> cluster .
>>>
>>> Sent from my iPhone
>>>
>>> > On Aug 22, 2015, at 8:43 PM, Jing Lu  wrote:
>>> >
>>> > Dear RDKit users,
>>> >
>>> > If I want to cluster more than 1M molecules by ECFP4. How could I do
>>> it? If I calculate the distance between every pair of molecules, the size
>>> of distance matrix will be too big. Does RDKit support any heuristic
>>> clustering algorithm without calculating the distance matrix of the whole
>>> library?
>>> >
>>> >
>>> >
>>> > Thanks,
>>> > Jing
>>> >
>>> --
>>> > ___
>>> > Rdkit-discuss mailing list
>>> > Rdkit-discuss@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Andrew!

Yes, I was thinking about using scikit-learn also. But I guess I need to
use a data structure for sparse matrix and define a function for
connectivity. I hope the memory issue won't be a problem.
Most AgglomerativeClustering algorithms have time complexity with N^2. Will
that be a problem?



Best,
Jing

On Sun, Aug 23, 2015 at 3:13 AM, Andrew Dalke 
wrote:

> On Aug 23, 2015, at 3:43 AM, Jing Lu wrote:
> > If I want to cluster more than 1M molecules by ECFP4. How could I do it?
> If I calculate the distance between every pair of molecules, the size of
> distance matrix will be too big. Does RDKit support any heuristic
> clustering algorithm without calculating the distance matrix of the whole
> library?
>
> You should look to a third-party package, like scikit-learn from
> http://scikit-learn.org/ , for clustering. That has a very extensive set
> of clustering algorithms, including k-means at
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
> .
>
> Though you may be interested in the note on that page: "For large scale
> learning (say n_samples > 10k) MiniBatchKMeans is probably much faster to
> than the default batch implementation."
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html
>
> My memory is that k-means depends on a Euclidean distance between the
> records, which is different from the usual Tanimoto (or metric-like
> 1-Tanimoto) in cheminformatics.
>
> If you would rather use Tanimoto, then perhaps try a method like
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
> ?
>
> If you go that route, and you want to build the full 1M x 1M distance
> matrix, the usual approach is to ignore similarities below a given
> threshold T (e.g., T<0.8). This can be thought of as either setting those
> entries to 0.0, or specifying an "ignore" flag. In either case, the result
> can be stored in a sparse matrix, which is efficient at storing only the
> data of interest.
>
> Using my package, chemfp, from http://chemfp.com/ you can compute a
> sparse matrix for 1M x 1M fingerprints in about an hour using a laptop or
> desktop.
>
> The question would then be how to adapt the parse output format from
> chemfp to the sparse input format for your clustering method of choice.
>
> Best regards,
>
> Andrew
> da...@dalkescientific.com
>
>
>
>
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 3:43 AM, Jing Lu wrote:
> If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I 
> calculate the distance between every pair of molecules, the size of distance 
> matrix will be too big. Does RDKit support any heuristic clustering algorithm 
> without calculating the distance matrix of the whole library?

You should look to a third-party package, like scikit-learn from 
http://scikit-learn.org/ , for clustering. That has a very extensive set of 
clustering algorithms, including k-means at 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html .

Though you may be interested in the note on that page: "For large scale 
learning (say n_samples > 10k) MiniBatchKMeans is probably much faster to than 
the default batch implementation." 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

My memory is that k-means depends on a Euclidean distance between the records, 
which is different from the usual Tanimoto (or metric-like 1-Tanimoto) in 
cheminformatics. 

If you would rather use Tanimoto, then perhaps try a method like 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
 ?

If you go that route, and you want to build the full 1M x 1M distance matrix, 
the usual approach is to ignore similarities below a given threshold T (e.g., 
T<0.8). This can be thought of as either setting those entries to 0.0, or 
specifying an "ignore" flag. In either case, the result can be stored in a 
sparse matrix, which is efficient at storing only the data of interest.

Using my package, chemfp, from http://chemfp.com/ you can compute a sparse 
matrix for 1M x 1M fingerprints in about an hour using a laptop or desktop.

The question would then be how to adapt the parse output format from chemfp to 
the sparse input format for your clustering method of choice.

Best regards,

Andrew
da...@dalkescientific.com



--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Taka Seri
Dear Jing,

How about your trying using bayon ?
https://code.google.com/p/bayon/
It's not function of RDKit, but I think the library can cluster molecules
using ECFP4.

Unfortunately, input file format of bayon is not distance matrix but easy
to prepare the format.

Best regards.

Takayuki


2015年8月23日(日) 12:03 Jing Lu :

> Currently, I prefer fingerprint based clustering, because it's hard to set
> the cutoff for scaffold based clustering. Does RDKit have scaffold based
> clustering?
>
> On Sat, Aug 22, 2015 at 10:56 PM,  wrote:
>
>> Hi, how about scaffold based clustering . You extract the scaffolds and
>> then cluster it and then put the respective scaffold compounds inside the
>> cluster .
>>
>> Sent from my iPhone
>>
>> > On Aug 22, 2015, at 8:43 PM, Jing Lu  wrote:
>> >
>> > Dear RDKit users,
>> >
>> > If I want to cluster more than 1M molecules by ECFP4. How could I do
>> it? If I calculate the distance between every pair of molecules, the size
>> of distance matrix will be too big. Does RDKit support any heuristic
>> clustering algorithm without calculating the distance matrix of the whole
>> library?
>> >
>> >
>> >
>> > Thanks,
>> > Jing
>> >
>> --
>> > ___
>> > Rdkit-discuss mailing list
>> > Rdkit-discuss@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Jing Lu
Currently, I prefer fingerprint based clustering, because it's hard to set
the cutoff for scaffold based clustering. Does RDKit have scaffold based
clustering?

On Sat, Aug 22, 2015 at 10:56 PM,  wrote:

> Hi, how about scaffold based clustering . You extract the scaffolds and
> then cluster it and then put the respective scaffold compounds inside the
> cluster .
>
> Sent from my iPhone
>
> > On Aug 22, 2015, at 8:43 PM, Jing Lu  wrote:
> >
> > Dear RDKit users,
> >
> > If I want to cluster more than 1M molecules by ECFP4. How could I do it?
> If I calculate the distance between every pair of molecules, the size of
> distance matrix will be too big. Does RDKit support any heuristic
> clustering algorithm without calculating the distance matrix of the whole
> library?
> >
> >
> >
> > Thanks,
> > Jing
> >
> --
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread abhik1368
Hi, how about scaffold based clustering . You extract the scaffolds and then 
cluster it and then put the respective scaffold compounds inside the cluster . 

Sent from my iPhone

> On Aug 22, 2015, at 8:43 PM, Jing Lu  wrote:
> 
> Dear RDKit users,
> 
> If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I 
> calculate the distance between every pair of molecules, the size of distance 
> matrix will be too big. Does RDKit support any heuristic clustering algorithm 
> without calculating the distance matrix of the whole library?
> 
> 
> 
> Thanks,
> Jing
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss