Dear All,

I am trying to cluster 350k english text phrases (each with 4-20 words) into
50k clusters with KMeans on a standalone system (8 cores, 16 GB). I am using
Kyro serializer with MEMORY_AND_DISK_SER set. Although I get clustering
results with lower number of features in HashingTF, the clustering quality
is poor. When I increase the number of features, I am hit with GC overhead
limit exceeded. How can I run the Kmeans clustering with the maximum number
of features without crashing the app? I don't mind if it takes hours to get
the results though. 

Also is there a agglomerative clustering algorithm (like hierarchical) in
Spark that can run on standalone systems?

Here is my code for reference - 

object phrase_app {
  def main(args: Array[String]) {
  
    val conf = new SparkConf().setAppName("Simple Application")
        conf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
    val sc = new SparkContext(conf)     
        
        // ------ read phrases from text file -----------
        val phrases = sc.textFile("phrases.txt", 
10).persist(MEMORY_AND_DISK_SER)       
        
        // ---- featurize phrases --------
        val no_features = 500
        val tf = new HashingTF(no_features)
        def featurize(s: String): Vector = {
                tf.transform(s.sliding(1).toSeq)
          }
        val featureVectors = phrases.map(featurize).persist(MEMORY_AND_DISK_SER)
        
        // ------ train Kmeans and get cluster phrases
        //val model = KMeans.train(featureVectors, 50000, 10, 1, "random")
        val model = KMeans.train(featureVectors, 50000, 10)
        val clusters = model.predict(featureVectors).collect()
        
        // ---- Print phrases and clusters to file --------
        import java.io._
        val pw = new PrintWriter(new File("cluster_dump.txt" )) 
        val phrases_array = phrases.collect()
        for (i <- 0 until phrases_array.length){
                        pw.write( phrases_array(i) + ";" + clusters(i) + "\n")
        }       
        pw.close        
  }
}


Thank you for your support. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-issues-and-hierarchical-clustering-tp24494.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to