SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your > dataset as well, I got the expected answer. And I believe that even though > initialization is done using sampling, the example actually sets the seed to > a constant 42, so the result should always be the same no matter how many > times you run it. So I am not really sure whats going on here. > > Can you tell us more about which version of Spark you are running? Which > Java version? > > > ====================================== > > [tdas @ Xion spark2] cat input > 2 1 > 1 2 > 3 2 > 2 3 > 4 1 > 5 1 > 6 1 > 4 2 > 6 2 > 4 3 > 5 3 > 6 3 > [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 > 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from > SCDynamicStore > 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded > 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: > com.github.fommil.netlib.NativeSystemBLAS > 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: > com.github.fommil.netlib.NativeRefBLAS > Finished iteration (delta = 3.0) > Finished iteration (delta = 0.0) > Final centers: > DenseVector(5.0, 2.0) > DenseVector(2.0, 2.0) > > > > On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote: >> >> so this is what I am running: >> "./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001" >> >> And this is the input file:" >> ┌───[spark2013@SparkOne]──────[~/spark-1.0.0].$ >> └───#!cat ~/Documents/2dim2.txt >> 2 1 >> 1 2 >> 3 2 >> 2 3 >> 4 1 >> 5 1 >> 6 1 >> 4 2 >> 6 2 >> 4 3 >> 5 3 >> 6 3 >> " >> >> This is the final output from spark: >> "14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: >> Getting 2 non-empty blocks out of 2 blocks >> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: >> Started 0 remote fetches in 0 ms >> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: >> maxBytesInFlight: 50331648, targetRequestSize: 10066329 >> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: >> Getting 2 non-empty blocks out of 2 blocks >> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: >> Started 0 remote fetches in 0 ms >> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 >> 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver >> 14/07/10 20:05:12 INFO Executor: Finished task ID 14 >> 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) >> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on >> localhost (progress: 1/2) >> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 >> 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver >> 14/07/10 20:05:12 INFO Executor: Finished task ID 15 >> 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) >> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on >> localhost (progress: 2/2) >> 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at >> SparkKMeans.scala:75) finished in 0.008 s >> 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks >> have all completed, from pool >> 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at >> SparkKMeans.scala:75, took 0.02472681 s >> Finished iteration (delta = 0.0) >> Final centers: >> DenseVector(2.8571428571428568, 2.0) >> DenseVector(5.6000000000000005, 2.0) >> " >> >> >> >> >> On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux <decho...@gmail.com> >> wrote: >> >> >> A picture is worth a thousand... Well, a picture with this dataset, what >> you are expecting and what you get, would help answering your initial >> question. >> >> Bertrand >> >> >> On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk <wanda_haw...@yahoo.com> >> wrote: >> >> Can someone please run the standard kMeans code on this input with 2 >> centers ?: >> 2 1 >> 1 2 >> 3 2 >> 2 3 >> 4 1 >> 5 1 >> 6 1 >> 4 2 >> 6 2 >> 4 3 >> 5 3 >> 6 3 >> >> The obvious result should be (2,2) and (5,2) ... (you can draw them if you >> don't believe me ...) >> >> Thanks, >> Wanda >> >> >> >> >