Dear All, I am trying to cluster 350k english text phrases (each with 4-20 words) into 50k clusters with KMeans on a standalone system (8 cores, 16 GB). I am using Kyro serializer with MEMORY_AND_DISK_SER set. Although I get clustering results with lower number of features in HashingTF, the clustering quality is poor. When I increase the number of features, I am hit with GC overhead limit exceeded. How can I run the Kmeans clustering with the maximum number of features without crashing the app? I don't mind if it takes hours to get the results though.
Also is there a agglomerative clustering algorithm (like hierarchical) in Spark that can run on standalone systems? Here is my code for reference - object phrase_app { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val sc = new SparkContext(conf) // ------ read phrases from text file ----------- val phrases = sc.textFile("phrases.txt", 10).persist(MEMORY_AND_DISK_SER) // ---- featurize phrases -------- val no_features = 500 val tf = new HashingTF(no_features) def featurize(s: String): Vector = { tf.transform(s.sliding(1).toSeq) } val featureVectors = phrases.map(featurize).persist(MEMORY_AND_DISK_SER) // ------ train Kmeans and get cluster phrases //val model = KMeans.train(featureVectors, 50000, 10, 1, "random") val model = KMeans.train(featureVectors, 50000, 10) val clusters = model.predict(featureVectors).collect() // ---- Print phrases and clusters to file -------- import java.io._ val pw = new PrintWriter(new File("cluster_dump.txt" )) val phrases_array = phrases.collect() for (i <- 0 until phrases_array.length){ pw.write( phrases_array(i) + ";" + clusters(i) + "\n") } pw.close } } Thank you for your support. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-issues-and-hierarchical-clustering-tp24494.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org