Hi Ted, Thanks for response. I have read the document. Even i am rusty in math and english is not my primary language, i think i understood principles from the docs.
I overlooked this part from the Mahout docs: "The seeding stage is an initial guess of where the centroids should be. The initial guess is improved using the ball k-means stage." Now i see that streaming sets initial centroids and ball k-means improve centroids. I was expecting clusters like in kmeans, but i got: Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable But my question still stands. How to use this to cluster data? I was thinking to hack kmeans to use results from stream kmeans as initial centroids and then cluster data. Also as i see this stream kmeans is for large sets of data. Does this large means large number of points and not dimmensions? And what to do when data have large dimensions? Like more then 1000000 dimensions. Best regards. On Thu, Jul 24, 2014 at 12:37 AM, Ted Dunning <[email protected]> wrote: > On Wed, Jul 23, 2014 at 2:10 AM, Bojan Kostić <[email protected]> > wrote: > > > <clustering questions> > > > What am i missing? > > > > > Did you read the referenced papers? > > Notably: > > > http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf >
