When trying to use KMeans.train with some large data and 5 worker nodes, it would due to BlockManagers shutting down because of timeout. I was able to prevent that by adding spark.storage.blockManagerSlaveTimeoutMs 3000000
to the spark-defaults.conf. However, with 1 Million feature vectors, the Stage takeSample at KMeans.scala:263 runs for about 50 minutes. In this time, about half of the tasks are done, then I lose the executors and Spark starts a new repartitioning stage. I also noticed that in the takeSample stage, the task was running for about 2.5 minutes until suddenly it is finished and duration (prev. those 2.5min) change to 2s, with 0.9s GC time. The training data is supplied in this form: var vectors2 = vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var broadcastVector = sc.broadcast(vectors2) The 1000 partitions is something that could probably be optimized, but too few will cause OOM erros. Using Ganglia, I can see that the master node is the only one that is properly busy regarding CPU, and that most (600-700 of 800 total percent CPU) is used by the master. The workers on each node only use 1 Core, i.e. 100% CPU. What would be the most likely cause for such an inefficient use of the cluster, and how to prevent it? Number of partitions, way of caching, ...? I'm trying to find out myself with tests, but ideas from someone with more experience are very welcome. Best regards, simn -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org