Hi all, I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here:
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html Specifically this code: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala Except I'm trying to run it in batch mode on some tweets it pulls out of Cassandra, in this case 200 total tweets. As the example shows, I am using this object for "vectorizing" a set of tweets: object Utils{ val numFeatures = 1000 val tf = new HashingTF(numFeatures) /** * Create feature vectors by turning each tweet into bigrams of * characters (an n-gram model) and then hashing those to a * length-1000 feature vector that we can pass to MLlib. * This is a common way to decrease the number of features in a * model while still getting excellent accuracy (otherwise every * pair of Unicode characters would potentially be a feature). */ def featurize(s: String): Vector = { tf.transform(s.sliding(2).toSeq) } } Here is my code which is modified from ExaminAndTrain.scala: val noSets = rawTweets.map(set => set.mkString("\n")) val vectors = noSets.map(Utils.featurize).cache() vectors.count() val numClusters = 5 val numIterations = 30 val model = KMeans.train(vectors, numClusters, numIterations) for (i <- 0 until numClusters) { println(s"\nCLUSTER $i") noSets.foreach { t => if (model.predict(Utils.featurize(t)) == 1) { println(t) } } } This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc with nothing printing beneath. If i flip models.predict(Utils.featurize(t)) == 1 to models.predict(Utils.featurize(t)) == 0 the same thing happens except every tweet is printed beneath every cluster. Here is what I intuitively think is happening (please correct my thinking if its wrong): This code turns each tweet into a vector, randomly picks some clusters, then runs kmeans to group the tweets (at a really high level, the clusters, i assume, would be common "topics"). As such, when it checks each tweet to see if models.predict == 1, different sets of tweets should appear under each cluster (and because its checking the training set against itself, every tweet should be in a cluster). Why isn't it doing this? Either my understanding of what kmeans does is wrong, my training set is too small or I'm missing a step. Any help is greatly appreciated --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org