Re: Selecting first ten values in a RDD/partition
as brian g alluded to earlier, you can use DStream.mapPartitions() to return the partition-local top 10 for each partition. once you collect the results from all the partitions, you can do a global top 10 merge sort across all partitions. this leads to a much much-smaller dataset to be shuffled back to the driver to calculate the global top 10. On Fri, May 30, 2014 at 5:05 AM, nilmish nilmish@gmail.com wrote: My primary goal : To get top 10 hashtag for every 5 mins interval. I want to do this efficiently. I have already done this by using reducebykeyandwindow() and then sorting all hashtag in 5 mins interval taking only top 10 elements. But this is very slow. So I now I am thinking of retaining only top 10 hashtags in each RDD because these only could come in the final answer. I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM ? Basically I need to transform my DTREAM in which each RDD contains only top 10 hashtags so that number of hashtags in 5 mins interval is low. If there is some more efficient way of doing this then please let me know that also. Thanx, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Selecting first ten values in a RDD/partition
My primary goal : To get top 10 hashtag for every 5 mins interval. I want to do this efficiently. I have already done this by using reducebykeyandwindow() and then sorting all hashtag in 5 mins interval taking only top 10 elements. But this is very slow. So I now I am thinking of retaining only top 10 hashtags in each RDD because these only could come in the final answer. I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM ? Basically I need to transform my DTREAM in which each RDD contains only top 10 hashtags so that number of hashtags in 5 mins interval is low. If there is some more efficient way of doing this then please let me know that also. Thanx, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Selecting first ten values in a RDD/partition
Can you clarify what you're trying to achieve here ? If you want to take only top 10 of each RDD, why don't sort followed by take(10) of every RDD ? Or, you want to take top 10 of five minutes ? Cheers, On Thu, May 29, 2014 at 2:04 PM, nilmish nilmish@gmail.com wrote: I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted each RDD and want to retain only top 10 values and discard further value. How can I retain only top 10 values ? I am trying to get top 10 hashtags. Instead of sorting the entire of 5-minute-counts (thereby, incurring the cost of a data shuffle), I am trying to get the top 10 hashtags in each partition. I am struck at how to retain top 10 hashtags in each partition. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Selecting first ten values in a RDD/partition
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects. It will give you direct access to an iterator containing the member objects of each partition for doing the kind of within-partition hashtag counts you're describing. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6534.html Sent from the Apache Spark User List mailing list archive at Nabble.com.