I am traveling and it is difficult to get a real internet connection. 

Here is an answer one of your questions. 

For very dimension data, some kind of dimensionality reduction is usually 
important. The streaming k-means code does the by approximating the nearest 
centroid by using a random projection. 

Note that the output of the streaming step is *not* a set of initial centroids. 
Instead it is a large number of centroids which are clustered as a surrogate 
for the original data.  These centroids are much less numerous than the 
original data so the final ball k-means can run in memory. This is very 
different than the canopy approach. 

There is a known issue with the map-reduce version of the streaming k-means 
program that causes the number of centroids output by the parallel part of the 
algorithm to be too large. 

There is a known issue


Sent from my iPhone

> On Jul 28, 2014, at 3:08, Bojan Kostić <[email protected]> wrote:
> 
> Also as i see this stream kmeans is for large sets of data. Does this large
> means large number of points and not dimmensions? And what to do when data
> have large dimensions? Like more then 1000000 dimensions.

Reply via email to