Re: [scikit-learn] clustering on big dataset

2018-02-07 Thread Manuel Castejón Limas
Hope this helps!

Manuel


@Article{Ciampi2008,

author="Ciampi, Antonio

and Lechevallier, Yves

and Limas, Manuel Castej{\'o}n

and Marcos, Ana Gonz{\'a}lez",

title="Hierarchical clustering of subpopulations with a dissimilarity based
on the likelihood ratio statistic: application to clustering massive data
sets",

journal="Pattern Analysis and Applications",

year="2008",

month="Jun",

day="01",

volume="11",

number="2",

pages="199--220",

abstract="The problem of clustering subpopulations on the basis of samples
is considered within a statistical framework: a distribution for the
variables is assumed for each subpopulation and the dissimilarity between
any two populations is defined as the likelihood ratio statistic which
compares the hypothesis that the two subpopulations differ in the parameter
of their distributions to the hypothesis that they do not. A general
algorithm for the construction of a hierarchical classification is
described which has the important property of not having inversions in the
dendrogram. The essential elements of the algorithm are specified for the
case of well-known distributions (normal, multinomial and Poisson) and an
outline of the general parametric case is also discussed. Several
applications are discussed, the main one being a novel approach to dealing
with massive data in the context of a two-step approach. After clustering
the data in a reasonable number of `bins' by a fast algorithm such as
k-Means, we apply a version of our algorithm to the resulting bins.
Multivariate normality for the means calculated on each bin is assumed:
this is justified by the central limit theorem and the assumption that each
bin contains a large number of units, an assumption generally justified
when dealing with truly massive data such as currently found in modern data
analysis. However, no assumption is made about the data generating
distribution.",

issn="1433-755X",

doi="10.1007/s10044-007-0088-4",

url="https://doi.org/10.1007/s10044-007-0088-4;

}





2018-01-04 12:55 GMT+01:00 Joel Nothman :

> Can you use nearest neighbors with a KD tree to build a distance matrix
> that is sparse, in that distances to all but the nearest neighbors of a
> point are (near-)infinite? Yes, this again has an additional parameter
> (neighborhood size), just as BIRCH has its threshold. I suspect you will
> not be able to improve on having another, approximating, parameter. You do
> not need to set n_clusters to a fixed value for BIRCH. You only need to
> provide another clusterer, which has its own parameters, although you
> should be able to experiment with different "global clusterers".
>
> On 4 January 2018 at 11:04, Shiheng Duan  wrote:
>
>> Yes, it is an efficient method, still, we need to specify the number of
>> clusters or the threshold. Is there another way to run hierarchy clustering
>> on the big dataset? The main problem is the distance matrix.
>> Thanks.
>>
>> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel 
>> wrote:
>>
>>> Have you had a look at BIRCH?
>>>
>>> http://scikit-learn.org/stable/modules/clustering.html#birch
>>>
>>> --
>>> Olivier
>>> ​
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] clustering on big dataset

2018-01-04 Thread Joel Nothman
Yes, use an approximate nearest neighbors approach. None is included in
scikit-learn, but there are numerous implementations with Python interfaces.

On 5 January 2018 at 12:51, Shiheng Duan  wrote:

> Thanks, Joel,
> I am working on KD-tree to find the nearest neighbors. Basically, I find
> the nearest neighbors for each point and then merge a couple of points if
> they are both NN for each other. The problem is that after each iteration,
> we will have a new bunch of points, where new clusters are added. So the
> tree needs to be updated. Since I didn't find any dynamic way to update the
> tree, I just rebuild it after each iteration, costing lots of time. Any
> idea about it?
> Actually, it takes around 16 mins to build the tree in the first
> iteration, which is not slow I think. But it still runs slowly. I have a
> dataset of 12*872505 (features, samples). It takes several days to run the
> program. Is there any way to speed up the query process of NN? I doubt
> query may be too slow.
> Thanks for your time.
>
> On Thu, Jan 4, 2018 at 3:55 AM, Joel Nothman 
> wrote:
>
>> Can you use nearest neighbors with a KD tree to build a distance matrix
>> that is sparse, in that distances to all but the nearest neighbors of a
>> point are (near-)infinite? Yes, this again has an additional parameter
>> (neighborhood size), just as BIRCH has its threshold. I suspect you will
>> not be able to improve on having another, approximating, parameter. You do
>> not need to set n_clusters to a fixed value for BIRCH. You only need to
>> provide another clusterer, which has its own parameters, although you
>> should be able to experiment with different "global clusterers".
>>
>> On 4 January 2018 at 11:04, Shiheng Duan  wrote:
>>
>>> Yes, it is an efficient method, still, we need to specify the number of
>>> clusters or the threshold. Is there another way to run hierarchy clustering
>>> on the big dataset? The main problem is the distance matrix.
>>> Thanks.
>>>
>>> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel >> > wrote:
>>>
 Have you had a look at BIRCH?

 http://scikit-learn.org/stable/modules/clustering.html#birch

 --
 Olivier
 ​

 ___
 scikit-learn mailing list
 scikit-learn@python.org
 https://mail.python.org/mailman/listinfo/scikit-learn


>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] clustering on big dataset

2018-01-04 Thread Shiheng Duan
Thanks, Joel,
I am working on KD-tree to find the nearest neighbors. Basically, I find
the nearest neighbors for each point and then merge a couple of points if
they are both NN for each other. The problem is that after each iteration,
we will have a new bunch of points, where new clusters are added. So the
tree needs to be updated. Since I didn't find any dynamic way to update the
tree, I just rebuild it after each iteration, costing lots of time. Any
idea about it?
Actually, it takes around 16 mins to build the tree in the first iteration,
which is not slow I think. But it still runs slowly. I have a dataset of
12*872505 (features, samples). It takes several days to run the program. Is
there any way to speed up the query process of NN? I doubt query may be too
slow.
Thanks for your time.

On Thu, Jan 4, 2018 at 3:55 AM, Joel Nothman  wrote:

> Can you use nearest neighbors with a KD tree to build a distance matrix
> that is sparse, in that distances to all but the nearest neighbors of a
> point are (near-)infinite? Yes, this again has an additional parameter
> (neighborhood size), just as BIRCH has its threshold. I suspect you will
> not be able to improve on having another, approximating, parameter. You do
> not need to set n_clusters to a fixed value for BIRCH. You only need to
> provide another clusterer, which has its own parameters, although you
> should be able to experiment with different "global clusterers".
>
> On 4 January 2018 at 11:04, Shiheng Duan  wrote:
>
>> Yes, it is an efficient method, still, we need to specify the number of
>> clusters or the threshold. Is there another way to run hierarchy clustering
>> on the big dataset? The main problem is the distance matrix.
>> Thanks.
>>
>> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel 
>> wrote:
>>
>>> Have you had a look at BIRCH?
>>>
>>> http://scikit-learn.org/stable/modules/clustering.html#birch
>>>
>>> --
>>> Olivier
>>> ​
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] clustering on big dataset

2018-01-03 Thread Shiheng Duan
Yes, it is an efficient method, still, we need to specify the number of
clusters or the threshold. Is there another way to run hierarchy clustering
on the big dataset? The main problem is the distance matrix.
Thanks.

On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel 
wrote:

> Have you had a look at BIRCH?
>
> http://scikit-learn.org/stable/modules/clustering.html#birch
>
> --
> Olivier
> ​
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] clustering on big dataset

2018-01-02 Thread Olivier Grisel
Have you had a look at BIRCH?

http://scikit-learn.org/stable/modules/clustering.html#birch

-- 
Olivier
​
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] clustering on big dataset

2018-01-01 Thread Shiheng Duan
Hi all,
I wonder if there is any method to do exact clustering (hierarchy cluster)
on a huge dataset where it is impossible to use distance matrix. I am
considering KD-tree but every time it needs to rebuild it, consuming lots
time.
Any ideas?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn