Only master is really busy at KMeans training

durin Tue, 19 Aug 2014 13:42:30 -0700

When trying to use KMeans.train with some large data and 5 worker nodes, it
would due to BlockManagers shutting down because of timeout. I was able to
prevent that by adding
 
spark.storage.blockManagerSlaveTimeoutMs 3000000


to the spark-defaults.conf.

However, with 1 Million feature vectors, the Stage takeSample at
KMeans.scala:263 runs for about 50 minutes. In this time, about half of the
tasks are done, then I lose the executors and Spark starts a new
repartitioning stage.

I also noticed that in the takeSample stage, the task was running for about
2.5 minutes until suddenly it is finished and duration (prev. those 2.5min)
change to 2s, with 0.9s GC time.

The training data is supplied in this form:
var vectors2 =
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector = sc.broadcast(vectors2)

The 1000 partitions is something that could probably be optimized, but too
few will cause OOM erros.

Using Ganglia, I can see that the master node is the only one that is
properly busy regarding CPU, and that most (600-700 of 800 total percent
CPU) is used by the master. 
The workers on each node only use 1 Core, i.e. 100% CPU.


What would be the most likely cause for such an inefficient use of the
cluster, and how to prevent it?
Number of partitions, way of caching, ...? 

I'm trying to find out myself with tests, but ideas from someone with more
experience are very welcome.


Best regards,
simn



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Only master is really busy at KMeans training

Reply via email to