By default, HashingTF turns each document into a sparse vector in R^(2^20), i.e. a million dimensional space. The current Spark clusterer turns each sparse into a dense vector with a million entries when it is added to a cluster. Hence, the memory needed grows as the number of clusters times 8M bytes (8 bytes per double)....
You should try to use my new generalized kmeans clustering package <https://github.com/derrickburns/generalized-kmeans-clustering> , which works on high dimensional sparse data. You will want to use the RandomIndexing embedding: def sparseTrain(raw: RDD[Vector], k: Int): KMeansModel = { KMeans.train(raw, k, embeddingNames = List(LOW_DIMENSIONAL_RI) } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432p21437.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org