Re: problems with running K-means on hadoop's pseudo-distributed mode

Ted Dunning Mon, 09 Jun 2014 12:29:15 -0700

Heh?

Can you say more about what you are trying to do and how you are doing it?


Also, can you say how this matters to the community?

And how it relates to the recent clustering work done in Mahout?




On Mon, Jun 9, 2014 at 9:02 AM, Ajay Sharma <[email protected]> wrote:

> K-means Clustering
>  K-means: widely used clustering technique! ,Initialization: blind random
> on input data!
> Drawback: very sensitive to choice of initial clustercenters (seeds)!
> Local optimal can be arbitrarily bad wrt. objective function, compared to
> global optimal clustering
>
> Idea: spread the k initial cluster centers away from each other.!
> O(log k)-competitive with the optimal clustering" substantial convergence
> time speedups (empirical)!
>
> C - Sample a point uniformly at random from X
>     While `C´ < k do
>     Sample x € X with probability prop, to DSquare (x)
>     c <- C U {x}
> end while
>
> c € c: Cluster Center
> x € X: Data Point'D(x) distance between x and nearest Ck that has already
> chosen
>
> Test dataset
> 200 Clustering runs, each with and without k-means initialization
> Measure RSS (Intra-Class variance)
>
> K.Means optimal clustering 115 times (57.5%)
>
>  Implementation Test Dataset: 4 Square (n=16)
>
>
>
> Expected: 4 nice Cluster
>
>
>
>
>
>
>
>
> Evaluation on Test Dataset!
> • 200 clustering runs, each with and without kmeans++ initialization!
> • Measure RSS (intra-class variance)!
> • K-means! optimal clustering 115 times (57.5%) !
> • K-means++ ! optimal clustering 182 times (91%)!
>
> Comparison of the frequency distribution of RSS values between k-means and
> k-means
> ++ on the evaluation dataset (n=200)!
>
>
>
>  Comparison of the frequency distribution of RSS values between k-means and
> k-means
> ++ on the UCI real world dataset (n=500)!
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Jun 9, 2014 at 10:50 AM, sumit sharma <[email protected]> wrote:
>
> > Naïve Bayes can be used for text clustering effectively in Mahout.
> >
> >
> > On Mon, Jun 9, 2014 at 7:07 PM, Eeti Jain <[email protected]> wrote:
> >
> > >
> > > Sir, I have been working on hadoop/mahout platform and performing
> > > clustering
> > > on twitter data in my thesis work. I just want to know whether Mahout
> can
> > > handle text documents in some other language? Please if you can help me
> > > sir?
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> >
> > Best Regards:
> > Sumit Sharma
> >
>

Re: problems with running K-means on hadoop's pseudo-distributed mode

Reply via email to