Re: [scikit-learn] clustering on big dataset
Hope this helps! Manuel @Article{Ciampi2008, author="Ciampi, Antonio and Lechevallier, Yves and Limas, Manuel Castej{\'o}n and Marcos, Ana Gonz{\'a}lez", title="Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering massive data sets", journal="Pattern Analysis and Applications", year="2008", month="Jun", day="01", volume="11", number="2", pages="199--220", abstract="The problem of clustering subpopulations on the basis of samples is considered within a statistical framework: a distribution for the variables is assumed for each subpopulation and the dissimilarity between any two populations is defined as the likelihood ratio statistic which compares the hypothesis that the two subpopulations differ in the parameter of their distributions to the hypothesis that they do not. A general algorithm for the construction of a hierarchical classification is described which has the important property of not having inversions in the dendrogram. The essential elements of the algorithm are specified for the case of well-known distributions (normal, multinomial and Poisson) and an outline of the general parametric case is also discussed. Several applications are discussed, the main one being a novel approach to dealing with massive data in the context of a two-step approach. After clustering the data in a reasonable number of `bins' by a fast algorithm such as k-Means, we apply a version of our algorithm to the resulting bins. Multivariate normality for the means calculated on each bin is assumed: this is justified by the central limit theorem and the assumption that each bin contains a large number of units, an assumption generally justified when dealing with truly massive data such as currently found in modern data analysis. However, no assumption is made about the data generating distribution.", issn="1433-755X", doi="10.1007/s10044-007-0088-4", url="https://doi.org/10.1007/s10044-007-0088-4; } 2018-01-04 12:55 GMT+01:00 Joel Nothman: > Can you use nearest neighbors with a KD tree to build a distance matrix > that is sparse, in that distances to all but the nearest neighbors of a > point are (near-)infinite? Yes, this again has an additional parameter > (neighborhood size), just as BIRCH has its threshold. I suspect you will > not be able to improve on having another, approximating, parameter. You do > not need to set n_clusters to a fixed value for BIRCH. You only need to > provide another clusterer, which has its own parameters, although you > should be able to experiment with different "global clusterers". > > On 4 January 2018 at 11:04, Shiheng Duan wrote: > >> Yes, it is an efficient method, still, we need to specify the number of >> clusters or the threshold. Is there another way to run hierarchy clustering >> on the big dataset? The main problem is the distance matrix. >> Thanks. >> >> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel >> wrote: >> >>> Have you had a look at BIRCH? >>> >>> http://scikit-learn.org/stable/modules/clustering.html#birch >>> >>> -- >>> Olivier >>> >>> >>> ___ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] clustering on big dataset
Yes, use an approximate nearest neighbors approach. None is included in scikit-learn, but there are numerous implementations with Python interfaces. On 5 January 2018 at 12:51, Shiheng Duanwrote: > Thanks, Joel, > I am working on KD-tree to find the nearest neighbors. Basically, I find > the nearest neighbors for each point and then merge a couple of points if > they are both NN for each other. The problem is that after each iteration, > we will have a new bunch of points, where new clusters are added. So the > tree needs to be updated. Since I didn't find any dynamic way to update the > tree, I just rebuild it after each iteration, costing lots of time. Any > idea about it? > Actually, it takes around 16 mins to build the tree in the first > iteration, which is not slow I think. But it still runs slowly. I have a > dataset of 12*872505 (features, samples). It takes several days to run the > program. Is there any way to speed up the query process of NN? I doubt > query may be too slow. > Thanks for your time. > > On Thu, Jan 4, 2018 at 3:55 AM, Joel Nothman > wrote: > >> Can you use nearest neighbors with a KD tree to build a distance matrix >> that is sparse, in that distances to all but the nearest neighbors of a >> point are (near-)infinite? Yes, this again has an additional parameter >> (neighborhood size), just as BIRCH has its threshold. I suspect you will >> not be able to improve on having another, approximating, parameter. You do >> not need to set n_clusters to a fixed value for BIRCH. You only need to >> provide another clusterer, which has its own parameters, although you >> should be able to experiment with different "global clusterers". >> >> On 4 January 2018 at 11:04, Shiheng Duan wrote: >> >>> Yes, it is an efficient method, still, we need to specify the number of >>> clusters or the threshold. Is there another way to run hierarchy clustering >>> on the big dataset? The main problem is the distance matrix. >>> Thanks. >>> >>> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel >> > wrote: >>> Have you had a look at BIRCH? http://scikit-learn.org/stable/modules/clustering.html#birch -- Olivier ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> ___ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] clustering on big dataset
Thanks, Joel, I am working on KD-tree to find the nearest neighbors. Basically, I find the nearest neighbors for each point and then merge a couple of points if they are both NN for each other. The problem is that after each iteration, we will have a new bunch of points, where new clusters are added. So the tree needs to be updated. Since I didn't find any dynamic way to update the tree, I just rebuild it after each iteration, costing lots of time. Any idea about it? Actually, it takes around 16 mins to build the tree in the first iteration, which is not slow I think. But it still runs slowly. I have a dataset of 12*872505 (features, samples). It takes several days to run the program. Is there any way to speed up the query process of NN? I doubt query may be too slow. Thanks for your time. On Thu, Jan 4, 2018 at 3:55 AM, Joel Nothmanwrote: > Can you use nearest neighbors with a KD tree to build a distance matrix > that is sparse, in that distances to all but the nearest neighbors of a > point are (near-)infinite? Yes, this again has an additional parameter > (neighborhood size), just as BIRCH has its threshold. I suspect you will > not be able to improve on having another, approximating, parameter. You do > not need to set n_clusters to a fixed value for BIRCH. You only need to > provide another clusterer, which has its own parameters, although you > should be able to experiment with different "global clusterers". > > On 4 January 2018 at 11:04, Shiheng Duan wrote: > >> Yes, it is an efficient method, still, we need to specify the number of >> clusters or the threshold. Is there another way to run hierarchy clustering >> on the big dataset? The main problem is the distance matrix. >> Thanks. >> >> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel >> wrote: >> >>> Have you had a look at BIRCH? >>> >>> http://scikit-learn.org/stable/modules/clustering.html#birch >>> >>> -- >>> Olivier >>> >>> >>> ___ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] clustering on big dataset
Yes, it is an efficient method, still, we need to specify the number of clusters or the threshold. Is there another way to run hierarchy clustering on the big dataset? The main problem is the distance matrix. Thanks. On Tue, Jan 2, 2018 at 6:02 AM, Olivier Griselwrote: > Have you had a look at BIRCH? > > http://scikit-learn.org/stable/modules/clustering.html#birch > > -- > Olivier > > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] clustering on big dataset
Have you had a look at BIRCH? http://scikit-learn.org/stable/modules/clustering.html#birch -- Olivier ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] clustering on big dataset
Hi all, I wonder if there is any method to do exact clustering (hierarchy cluster) on a huge dataset where it is impossible to use distance matrix. I am considering KD-tree but every time it needs to rebuild it, consuming lots time. Any ideas? ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn