Re: KMeans code is rubbish

Xiangrui Meng Thu, 10 Jul 2014 09:59:24 -0700

SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui


On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
<tathagata.das1...@gmail.com> wrote:
> I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
> dataset as well, I got the expected answer. And I believe that even though
> initialization is done using sampling, the example actually sets the seed to
> a constant 42, so the result should always be the same no matter how many
> times you run it. So I am not really sure whats going on here.
>
> Can you tell us more about which version of Spark you are running? Which
> Java version?
>
>
> ======================================
>
> [tdas @ Xion spark2] cat input
> 2 1
> 1 2
> 3 2
> 2 3
> 4 1
> 5 1
> 6 1
> 4 2
> 6 2
> 4 3
> 5 3
> 6 3
> [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
> 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from
> SCDynamicStore
> 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded
> 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
> 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
> Finished iteration (delta = 3.0)
> Finished iteration (delta = 0.0)
> Final centers:
> DenseVector(5.0, 2.0)
> DenseVector(2.0, 2.0)
>
>
>
> On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk <wanda_haw...@yahoo.com> wrote:
>>
>> so this is what I am running:
>> "./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001"
>>
>> And this is the input file:"
>> ┌───[spark2013@SparkOne]──────[~/spark-1.0.0].$
>> └───#!cat ~/Documents/2dim2.txt
>> 2 1
>> 1 2
>> 3 2
>> 2 3
>> 4 1
>> 5 1
>> 6 1
>> 4 2
>> 6 2
>> 4 3
>> 5 3
>> 6 3
>> "
>>
>> This is the final output from spark:
>> "14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Getting 2 non-empty blocks out of 2 blocks
>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Started 0 remote fetches in 0 ms
>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> maxBytesInFlight: 50331648, targetRequestSize: 10066329
>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Getting 2 non-empty blocks out of 2 blocks
>> 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Started 0 remote fetches in 0 ms
>> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
>> 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
>> 14/07/10 20:05:12 INFO Executor: Finished task ID 14
>> 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
>> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
>> localhost (progress: 1/2)
>> 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
>> 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
>> 14/07/10 20:05:12 INFO Executor: Finished task ID 15
>> 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
>> 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
>> localhost (progress: 2/2)
>> 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
>> SparkKMeans.scala:75) finished in 0.008 s
>> 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks
>> have all completed, from pool
>> 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
>> SparkKMeans.scala:75, took 0.02472681 s
>> Finished iteration (delta = 0.0)
>> Final centers:
>> DenseVector(2.8571428571428568, 2.0)
>> DenseVector(5.6000000000000005, 2.0)
>> "
>>
>>
>>
>>
>> On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux <decho...@gmail.com>
>> wrote:
>>
>>
>> A picture is worth a thousand... Well, a picture with this dataset, what
>> you are expecting and what you get, would help answering your initial
>> question.
>>
>> Bertrand
>>
>>
>> On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk <wanda_haw...@yahoo.com>
>> wrote:
>>
>> Can someone please run the standard kMeans code on this input with 2
>> centers ?:
>> 2 1
>> 1 2
>> 3 2
>> 2 3
>> 4 1
>> 5 1
>> 6 1
>> 4 2
>> 6 2
>> 4 3
>> 5 3
>> 6 3
>>
>> The obvious result should be (2,2) and (5,2) ... (you can draw them if you
>> don't believe me ...)
>>
>> Thanks,
>> Wanda
>>
>>
>>
>>
>

Re: KMeans code is rubbish

Reply via email to