I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here.
Can you tell us more about which version of Spark you are running? Which Java version? ====================================== [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote: > so this is what I am running: > "./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001" > > And this is the input file:" > ┌───[spark2013@SparkOne]──────[~/spark-1.0.0].$ > └───#!cat ~/Documents/2dim2.txt > 2 1 > 1 2 > 3 2 > 2 3 > 4 1 > 5 1 > 6 1 > 4 2 > 6 2 > 4 3 > 5 3 > 6 3 > " > > This is the final output from spark: > "14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: > Getting 2 non-empty blocks out of 2 blocks > 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: > Started 0 remote fetches in 0 ms > 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: > maxBytesInFlight: 50331648, targetRequestSize: 10066329 > 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: > Getting 2 non-empty blocks out of 2 blocks > 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: > Started 0 remote fetches in 0 ms > 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 > 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver > 14/07/10 20:05:12 INFO Executor: Finished task ID 14 > 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) > 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on > localhost (progress: 1/2) > 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 > 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver > 14/07/10 20:05:12 INFO Executor: Finished task ID 15 > 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) > 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on > localhost (progress: 2/2) > 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at > SparkKMeans.scala:75) finished in 0.008 s > 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks > have all completed, from pool > 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at > SparkKMeans.scala:75, took 0.02472681 s > Finished iteration (delta = 0.0) > Final centers: > DenseVector(2.8571428571428568, 2.0) > DenseVector(5.6000000000000005, 2.0) > " > > > > > On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux < > decho...@gmail.com> wrote: > > > A picture is worth a thousand... Well, a picture with this dataset, what > you are expecting and what you get, would help answering your initial > question. > > Bertrand > > > On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk <wanda_haw...@yahoo.com> > wrote: > > Can someone please run the standard kMeans code on this input with 2 > centers ?: > 2 1 > 1 2 > 3 2 > 2 3 > 4 1 > 5 1 > 6 1 > 4 2 > 6 2 > 4 3 > 5 3 > 6 3 > > The obvious result should be (2,2) and (5,2) ... (you can draw them if you > don't believe me ...) > > Thanks, > Wanda > > > > >