Why doesnt something like this work? If you want a continuously updated reference to the top counts, you can use a global variable.
var topCounts: Array[(String, Int)] = null sortedCounts.foreachRDD (rdd => val currentTopCounts = rdd.take(10) // print currentTopCounts it or watever topCounts = currentTopCounts ) TD On Mon, Jul 14, 2014 at 4:11 PM, jon.burns <jon.bu...@uleth.ca> wrote: > Hello everyone, > > I'm an undergrad working on a summarization project. I've created a > summarizer in normal Spark and it works great, however I want to write it > for Spark_Streaming to increase it's functionality. Basically I take in a > bunch of text and get the most popular words as well as most popular > bi-grams (Two words together), and I've managed to do this with streaming > (And made it stateful, which is great). However the next part of my > algorithm requires me to get the top 10 words and top 10 bigrams and store > them in a vector like structure. With just spark I would use code like; > > array_of_words = words.sortByKey().top(50) > > Is there a way to mimick this with streaming? I was following along with > the > ampcamp tutorial > < > http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html > > > so I know that you can print the top 10 by using; > > sortedCounts.foreach(rdd => > println("\nTop 10 hashtags:\n" + rdd.take(10).mkString("\n"))) > > However I can't seem to alter this to make it store the top 10, just print > them. The instructor mentions at the end that > > "one can get the top 10 hashtags in each partition, collect them together > at > the driver and then find the top 10 hashtags among them" but they leave it > as an exercise. I would appreciate any help :) > > Thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >