Looks like default algorithm used by R in kmeans function is Hartigan-Wong whereas Spark seems to be using Lloyd's algorithm. Can you rerun your kmeans R code using algorithm = "Lloyd" and see if the results match?
On Tue, Jan 3, 2017 at 12:18 AM, Saroj C <saro...@tcs.com> wrote: > Thanks Satya. > > I tried setting the initSteps as 25 and the maxIteration as 500, both in > R and Spark. The results provided below were from that settings. > > Also, in Spark and R the center remains almost the same, but they are > different from each other. > > > Thanks & Regards > Saroj > > > > > From: Satya Varaprasad Allumallu <alluma...@gmail.com> > To: Saroj C <saro...@tcs.com> > Cc: User <user@spark.apache.org> > Date: 01/02/2017 08:53 PM > Subject: Re: Difference in R and Spark Output > ------------------------------ > > > > Can you run Spark Kmeans algorithm multiple times and see if the centers > remain stable? I am > guessing it is related to random initialization of centers. > > On Mon, Jan 2, 2017 at 1:34 AM, Saroj C <*saro...@tcs.com* > <saro...@tcs.com>> wrote: > Dear Felix, > Thanks. Please find the differences > Cluster Spark - Size R- Size > > 0 > 69 > 114 > 1 > 79 > 141 > 2 > 77 > 93 > 3 > 90 > 44 > 4 > 130 > 53 > > > > Spark - Centers > > 0.807554406 > 0.123759 > -0.58642 > -0.17803 > 0.624278 > -0.06752 > 0.033517 > -0.01504 > -0.02794 > 0.016699 > 0.20841 > -0.00149 > -0.05598 > 0.039746 > 0.030756 > -0.19788 > -0.07906 > -0.14881 > 0.0056 > 0.01479 > 0.066883 > 0.002491 > -0.428583581 > -0.81975 > 0.347356 > -0.18664 > 0.047582 > 0.058692 > -0.0721 > -0.13873 > -0.08666 > 0.085334 > 0.054398 > -0.0228 > 0.008369 > 0.073103 > 0.022246 > -0.15439 > -0.06016 > -0.15073 > -0.03734 > 0.004299 > 0.089258 > -0.00694 > 0.692744675 > 0.148123 > 0.087253 > 0.851781 > -0.2179 > 0.003407 > -0.12357 > -0.01795 > 0.016427 > 0.088004 > 0.021502 > -0.04616 > -0.00847 > 0.023397 > 0.057656 > -0.12036 > -0.03947 > -0.13338 > -0.02975 > 0.012217 > 0.090547 > -0.00232 > -0.677692276 > 0.581091 > 0.446125 > -0.13087 > 0.037225 > 0.018936 > 0.055286 > 0.01146 > -0.08648 > 0.053719 > 0.072753 > -0.00873 > -0.04448 > 0.042067 > 0.089221 > -0.1977 > -0.07368 > -0.14674 > -0.00641 > 0.020815 > 0.058425 > 0.016745 > 1.03518389 > 0.228964 > 0.539982 > -0.3581 > -0.13488 > -0.00525 > -0.1267 > -0.04439 > -0.01923 > 0.111272 > -0.05181 > -0.05508 > -0.04143 > 0.046479 > 0.059224 > -0.16148 > -0.07541 > -0.12046 > -0.03535 > 0.003049 > 0.070862 > 0.010083 > R - Centers > > 0.7710882 > 0.86271 > 0.249609 > 0.074961 > 0.251188 > -0.05293 > -0.11106 > -0.08063 > 0.01516 > 0.054043 > 0.056937 > -0.0287 > -0.03291 > 0.056607 > 0.045214 > -0.15237 > -0.05442 > -0.14038 > -0.02326 > 0.013882 > 0.078523 > -0.0087 > -0.644077 > 0.022256 > 0.368266 > -0.06912 > 0.123979 > 0.009181 > -0.04506 > -0.04179 > -0.0255 > 0.041568 > 0.04081 > -0.02936 > -0.04849 > 0.049712 > 0.062894 > -0.16736 > -0.06679 > -0.12705 > -0.007 > 0.018079 > 0.062337 > 0.00349 > 0.9772678 > -0.57499 > 0.523792 > -0.27319 > 0.163677 > 0.053579 > -0.07616 > 0.074556 > 0.00662 > 0.087303 > 0.088835 > -0.01923 > -0.04938 > 0.07299 > 0.059872 > -0.19137 > -0.04737 > -0.1536 > 0.002926 > 0.049441 > 0.079147 > 0.02771 > 0.5172924 > 0.167666 > -0.16523 > -0.82951 > -0.77577 > -0.00981 > 0.018531 > -0.09629 > -0.1654 > 0.273644 > -0.05433 > -0.03593 > 0.115834 > 0.021465 > -0.00981 > -0.15112 > -0.16178 > -0.04783 > -0.19962 > -0.12418 > 0.07286 > 0.03266 > 0.717927 > -0.34229 > -0.33544 > 0.817617 > -0.21383 > 0.02735 > 0.01675 > -0.10814 > -0.1747 > 0.033743 > 0.038308 > -0.0495 > -0.05961 > -0.01977 > 0.092247 > -0.16017 > -0.04787 > -0.20766 > 0.040038 > 0.024614 > 0.090587 > -0.0236 > > > > > Please let me know, if any additional info will help to find these > anomalies. > > Thanks & Regards > Saroj > > > > > From: Felix Cheung <*felixcheun...@hotmail.com* > <felixcheun...@hotmail.com>> > To: User <*user@spark.apache.org* <user@spark.apache.org>>, Saroj > C <*saro...@tcs.com* <saro...@tcs.com>> > Date: 12/31/2016 10:36 AM > Subject: Re: Difference in R and Spark Output > ------------------------------ > > > > > Could you elaborate more on the huge difference you are seeing? > > > ------------------------------ > > * From:* Saroj C <*saro...@tcs.com* <saro...@tcs.com>> > * Sent:* Friday, December 30, 2016 5:12:04 AM > * To:* User > * Subject:* Difference in R and Spark Output > > Dear All, > For the attached input file, there is a huge difference between the > Clusters in R and Spark(ML). Any idea, what could be the difference ? > > Note we wanted to create Five(5) clusters. > > Please find the snippets in Spark and R > > Spark > > //Load the Data file > > // Create K means Cluster > KMeans kmeans = *new* KMeans().setK(5).setMaxIter(500) > .setFeaturesCol("features"). > setPredictionCol("prediction"); > > > In R > > //Load the Data File into df > > //Create the K Means Cluster > > model <- kmeans(df, 5) > > > > Thanks & Regards > Saroj > > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > >