I used "k-means||", which is the default. And it took less than 1
minute to finish. 50 iterations took less than 25 minutes on a cluster
of 9 m3.2xlarge EC2 nodes. Which deploy mode did you use? Is it
yarn-client? -Xiangrui

On Tue, Oct 14, 2014 at 6:03 PM, Ray <ray-w...@outlook.com> wrote:
> Hi Xiangrui,
>
> Thanks for the guidance. I read the log carefully and found the root cause.
>
> KMeans, by default, uses KMeans++ as the initialization mode. According to
> the log file, the 70-minute hanging is actually the computing time of
> Kmeans++, as pasted below:
>
> 14/10/14 14:48:18 INFO DAGScheduler: Stage 20 (collectAsMap at
> KMeans.scala:293) finished in 2.233 s
> 14/10/14 14:48:18 INFO SparkContext: Job finished: collectAsMap at
> KMeans.scala:293, took 85.590020124 s
> 14/10/14 14:48:18 INFO ShuffleBlockManager: Could not find files for shuffle
> 5 for deleting
> 14/10/14 *14:48:18* INFO ContextCleaner: Cleaned shuffle 5
> 14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
> 14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
> *14/10/14 15:54:36 INFO LocalKMeans: Local KMeans++ converged in 11
> iterations.
> 14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913
> seconds.*
> 14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at
> KMeans.scala:190
> 14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at
> KMeans.scala:190)
> 14/10/14 15:54:37 INFO DAGScheduler: Got job 16 (collectAsMap at
> KMeans.scala:190) with 100 output partitions (allowLocal=false)
> 14/10/14 15:54:37 INFO DAGScheduler: Final stage: Stage 22(collectAsMap at
> KMeans.scala:190)
>
>
>
> I now use "random" as the Kmeans initialization mode, and other confs remain
> the same. This time, it just finished quickly~~
>
> In your test on mnis8m, did you use KMeans++ as initialization mode? How
> long it takes?
>
> Thanks again for your help.
>
> Ray
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16450.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to