Re: Difference in R and Spark Output

Satya Varaprasad Allumallu Wed, 04 Jan 2017 12:16:55 -0800

Looks like default algorithm used by R in kmeans function is Hartigan-Wong
whereas Spark seems to be using Lloyd's algorithm.
Can you rerun your kmeans R code using algorithm = "Lloyd" and see if the
results match?



On Tue, Jan 3, 2017 at 12:18 AM, Saroj C <saro...@tcs.com> wrote:

> Thanks Satya.
>
>  I tried setting the initSteps as 25 and the maxIteration as 500, both in
> R and Spark. The results provided below were from that settings.
>
> Also, in Spark and R the center remains almost the same, but they are
> different from each other.
>
>
> Thanks & Regards
> Saroj
>
>
>
>
> From:        Satya Varaprasad Allumallu <alluma...@gmail.com>
> To:        Saroj C <saro...@tcs.com>
> Cc:        User <user@spark.apache.org>
> Date:        01/02/2017 08:53 PM
> Subject:        Re: Difference in R and Spark Output
> ------------------------------
>
>
>
> Can you run Spark Kmeans algorithm multiple times and see if the centers
> remain stable? I am
> guessing it is related to random initialization of centers.
>
> On Mon, Jan 2, 2017 at 1:34 AM, Saroj C <*saro...@tcs.com*
> <saro...@tcs.com>> wrote:
> Dear Felix,
>  Thanks. Please find the differences
> Cluster Spark - Size R- Size
>
> 0
> 69
> 114
> 1
> 79
> 141
> 2
> 77
> 93
> 3
> 90
> 44
> 4
> 130
> 53
>
>
>
> Spark - Centers
>
> 0.807554406
> 0.123759
> -0.58642
> -0.17803
> 0.624278
> -0.06752
> 0.033517
> -0.01504
> -0.02794
> 0.016699
> 0.20841
> -0.00149
> -0.05598
> 0.039746
> 0.030756
> -0.19788
> -0.07906
> -0.14881
> 0.0056
> 0.01479
> 0.066883
> 0.002491
> -0.428583581
> -0.81975
> 0.347356
> -0.18664
> 0.047582
> 0.058692
> -0.0721
> -0.13873
> -0.08666
> 0.085334
> 0.054398
> -0.0228
> 0.008369
> 0.073103
> 0.022246
> -0.15439
> -0.06016
> -0.15073
> -0.03734
> 0.004299
> 0.089258
> -0.00694
> 0.692744675
> 0.148123
> 0.087253
> 0.851781
> -0.2179
> 0.003407
> -0.12357
> -0.01795
> 0.016427
> 0.088004
> 0.021502
> -0.04616
> -0.00847
> 0.023397
> 0.057656
> -0.12036
> -0.03947
> -0.13338
> -0.02975
> 0.012217
> 0.090547
> -0.00232
> -0.677692276
> 0.581091
> 0.446125
> -0.13087
> 0.037225
> 0.018936
> 0.055286
> 0.01146
> -0.08648
> 0.053719
> 0.072753
> -0.00873
> -0.04448
> 0.042067
> 0.089221
> -0.1977
> -0.07368
> -0.14674
> -0.00641
> 0.020815
> 0.058425
> 0.016745
> 1.03518389
> 0.228964
> 0.539982
> -0.3581
> -0.13488
> -0.00525
> -0.1267
> -0.04439
> -0.01923
> 0.111272
> -0.05181
> -0.05508
> -0.04143
> 0.046479
> 0.059224
> -0.16148
> -0.07541
> -0.12046
> -0.03535
> 0.003049
> 0.070862
> 0.010083
> R - Centers
>
> 0.7710882
> 0.86271
> 0.249609
> 0.074961
> 0.251188
> -0.05293
> -0.11106
> -0.08063
> 0.01516
> 0.054043
> 0.056937
> -0.0287
> -0.03291
> 0.056607
> 0.045214
> -0.15237
> -0.05442
> -0.14038
> -0.02326
> 0.013882
> 0.078523
> -0.0087
> -0.644077
> 0.022256
> 0.368266
> -0.06912
> 0.123979
> 0.009181
> -0.04506
> -0.04179
> -0.0255
> 0.041568
> 0.04081
> -0.02936
> -0.04849
> 0.049712
> 0.062894
> -0.16736
> -0.06679
> -0.12705
> -0.007
> 0.018079
> 0.062337
> 0.00349
> 0.9772678
> -0.57499
> 0.523792
> -0.27319
> 0.163677
> 0.053579
> -0.07616
> 0.074556
> 0.00662
> 0.087303
> 0.088835
> -0.01923
> -0.04938
> 0.07299
> 0.059872
> -0.19137
> -0.04737
> -0.1536
> 0.002926
> 0.049441
> 0.079147
> 0.02771
> 0.5172924
> 0.167666
> -0.16523
> -0.82951
> -0.77577
> -0.00981
> 0.018531
> -0.09629
> -0.1654
> 0.273644
> -0.05433
> -0.03593
> 0.115834
> 0.021465
> -0.00981
> -0.15112
> -0.16178
> -0.04783
> -0.19962
> -0.12418
> 0.07286
> 0.03266
> 0.717927
> -0.34229
> -0.33544
> 0.817617
> -0.21383
> 0.02735
> 0.01675
> -0.10814
> -0.1747
> 0.033743
> 0.038308
> -0.0495
> -0.05961
> -0.01977
> 0.092247
> -0.16017
> -0.04787
> -0.20766
> 0.040038
> 0.024614
> 0.090587
> -0.0236
>
>
>
>
> Please let me know, if any additional info will help to find these
> anomalies.
>
> Thanks & Regards
> Saroj
>
>
>
>
> From:        Felix Cheung <*felixcheun...@hotmail.com*
> <felixcheun...@hotmail.com>>
> To:        User <*user@spark.apache.org* <user@spark.apache.org>>, Saroj
> C <*saro...@tcs.com* <saro...@tcs.com>>
> Date:        12/31/2016 10:36 AM
> Subject:        Re: Difference in R and Spark Output
> ------------------------------
>
>
>
>
> Could you elaborate more on the huge difference you are seeing?
>
>
> ------------------------------
>
> * From:* Saroj C <*saro...@tcs.com* <saro...@tcs.com>>
> * Sent:* Friday, December 30, 2016 5:12:04 AM
> * To:* User
> * Subject:* Difference in R and Spark Output
>
> Dear All,
> For the attached input file, there is a huge difference between the
> Clusters in R and Spark(ML). Any idea, what could be the difference ?
>
> Note we wanted to create Five(5) clusters.
>
> Please find the snippets in Spark and R
>
> Spark
>
> //Load the Data file
>
> // Create K means Cluster
>        KMeans kmeans = *new* KMeans().setK(5).setMaxIter(500)
>                                .setFeaturesCol("features").
> setPredictionCol("prediction");
>
>
> In R
>
> //Load the Data File into df
>
> //Create the K Means Cluster
>
> model <- kmeans(df, 5)
>
>
>
> Thanks & Regards
> Saroj
>
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>

Re: Difference in R and Spark Output

Reply via email to