Hi
I would like to know if it is possible to sort within DStream. I know it's 
possible to sort within an RDD and I know it's impossible to sort within the 
entire DStream but I would be satisfied with sorting across 2 RDDs:

1) merge 2 consecutive RDDs
2) reduce by key + sort the merged data
3) take the first half of the data in the sorted RDD
4) merge the second half of the RDD with the next consecutive RDD
5) jump to step 2) and repeat

Here's an attempt to perform the steps 1-5. I window the stream to WindowSize 
20 then I take 10 seconds of these 20 (probably the 10 seconds don't start 
exactly at the beginning of the 20sec window but ignore that for this exercise) 
by using another window of size 10. The first 10 seconds is what I want but I 
want to skip every other 10 sec window.

Tuple format: (timestamp, count)

val bigWindow= Mystream.window(20,10)
bigWindow.reduceByKey( (t1,t2) => (t1._2 + t2._2) )
     .transform(rdd=>rdd.sortByKey(true))
     .window(10,10)    //make 10sec window out of the 20 sec window
     .filterBy(t =>  (t._1 <= bigWindow.midWindowValue._1 )  )     //take value 
from bigWindow (the 20 sec window) from time index 0 to 20/2 denoted in my 
filter by bigWindow.midWindowValue

a) Any ideas how to rewrite this query or how to get the element in the middle 
of a time window (or some arbitrary location)?
b) If you know how I can iterate and merge + split RDDs I'd like to give that a 
try as well instead of using 2 time windows.
c) Suggestions how to do the overall DStream reducing and sorting


-Adrian

Reply via email to