Re: Selecting first ten values in a RDD/partition

2014-06-29 Thread Chris Fregly
as brian g alluded to earlier, you can use DStream.mapPartitions() to
return  the partition-local top 10 for each partition.  once you collect
the results from all the partitions, you can do a global top 10 merge sort
across all partitions.

this leads to a much much-smaller dataset to be shuffled back to the driver
to calculate the global top 10.


On Fri, May 30, 2014 at 5:05 AM, nilmish nilmish@gmail.com wrote:

 My primary goal : To get top 10 hashtag for every 5 mins interval.

 I want to do this efficiently. I have already done this by using
 reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
 taking only top 10 elements. But this is very slow.

 So I now I am thinking of retaining only top 10 hashtags in each RDD
 because
 these only could come in the final answer.

 I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM
 ? Basically I need to transform my DTREAM in which each RDD contains only
 top 10 hashtags so that number of hashtags in 5 mins interval is low.

 If there is some more efficient way of doing this then please let me know
 that also.

 Thanx,
 Nilesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Selecting first ten values in a RDD/partition

2014-05-30 Thread nilmish
My primary goal : To get top 10 hashtag for every 5 mins interval.

I want to do this efficiently. I have already done this by using
reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
taking only top 10 elements. But this is very slow. 

So I now I am thinking of retaining only top 10 hashtags in each RDD because
these only could come in the final answer. 

I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM
? Basically I need to transform my DTREAM in which each RDD contains only
top 10 hashtags so that number of hashtags in 5 mins interval is low.

If there is some more efficient way of doing this then please let me know
that also.

Thanx,
Nilesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Selecting first ten values in a RDD/partition

2014-05-29 Thread nilmish
I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted
each RDD and want to retain only top 10 values and discard further value.
How can I retain only top 10 values ?

I am trying to get top 10 hashtags.  Instead of sorting the entire of
5-minute-counts (thereby, incurring the cost of a data shuffle), I am trying
to get the top 10 hashtags in each partition. I am struck at how to retain
top 10 hashtags in each partition.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Anwar Rizal
Can you clarify what you're trying to achieve here ?

If you want to take only top 10 of each RDD, why don't sort followed by
take(10) of every RDD ?

Or, you want to take top 10 of five minutes ?

Cheers,



On Thu, May 29, 2014 at 2:04 PM, nilmish nilmish@gmail.com wrote:

 I have a DSTREAM which consists of RDD partitioned every 2 sec. I have
 sorted
 each RDD and want to retain only top 10 values and discard further value.
 How can I retain only top 10 values ?

 I am trying to get top 10 hashtags.  Instead of sorting the entire of
 5-minute-counts (thereby, incurring the cost of a data shuffle), I am
 trying
 to get the top 10 hashtags in each partition. I am struck at how to retain
 top 10 hashtags in each partition.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Brian Gawalt
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects.
It will give you direct access to an iterator containing the member objects
of each partition for doing the kind of within-partition hashtag counts
you're describing.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.