I am traveling and it is difficult to get a real internet connection.
Here is an answer one of your questions. For very dimension data, some kind of dimensionality reduction is usually important. The streaming k-means code does the by approximating the nearest centroid by using a random projection. Note that the output of the streaming step is *not* a set of initial centroids. Instead it is a large number of centroids which are clustered as a surrogate for the original data. These centroids are much less numerous than the original data so the final ball k-means can run in memory. This is very different than the canopy approach. There is a known issue with the map-reduce version of the streaming k-means program that causes the number of centroids output by the parallel part of the algorithm to be too large. There is a known issue Sent from my iPhone > On Jul 28, 2014, at 3:08, Bojan Kostić <[email protected]> wrote: > > Also as i see this stream kmeans is for large sets of data. Does this large > means large number of points and not dimmensions? And what to do when data > have large dimensions? Like more then 1000000 dimensions.
