Re: Streaming kmeans question

Bojan Kostić Mon, 28 Jul 2014 02:09:07 -0700

Hi Ted,

Thanks for response.
I have read the document.
Even i am rusty in math and english is not my primary language, i
think i understood
principles from the docs.

I overlooked this part from the Mahout docs: "The seeding stage is an
initial guess of where the centroids should be. The initial guess is
improved using the ball k-means stage."
Now i see that streaming sets initial centroids and ball k-means improve
centroids.
I was expecting clusters like in kmeans, but i got:
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable

But my question still stands. How to use this to cluster data? I was
thinking to hack kmeans to use results from stream kmeans as initial
centroids and then cluster data.

Also as i see this stream kmeans is for large sets of data. Does this large
means large number of points and not dimmensions? And what to do when data
have large dimensions? Like more then 1000000 dimensions.

Best regards.

On Thu, Jul 24, 2014 at 12:37 AM, Ted Dunning <[email protected]> wrote:

> On Wed, Jul 23, 2014 at 2:10 AM, Bojan Kostić <[email protected]>
> wrote:
>
> > <clustering questions>
> >
> What am i missing?
> >
>
>
> Did you read the referenced papers?
>
> Notably:
>
>
> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf
>

Re: Streaming kmeans question

Reply via email to