Hi all,

I am using Spark Streaming for my use case.
I want to
        - partition or group the stream by key
        - window the tuples in partitions
 and - find max/min element in windows (in every partition)

My code is like:

         val keyedStream = socketDataSource.map(s => (s.key,s.value))    //
define key and value
         val aggregatedStream =
keyedStream.groupByKeyAndWindow(Milliseconds(8000),Milliseconds(4000))
// partition stream by key and window

.map(window=>minMaxTuples(window))       // reduce the window to find
max/min element



In minMaxTuples function I use window._2.toArray.maxBy/minBy to find
max/min element.

Maybe it is not the right way to do , if yes please correct me, but what I
realize inside minMaxTuples function is that, we are not reusing previously
computed results.

So, in the minibatch-n if we have a keyed window {key-1, [a,b,c,d,e]} and
we iterate for all elements( [a,b,c,d,e]) to find the result, in the next
minibatch (minibatch-n+1) we may have {key-1, [c,d,e,f,g]}, in which
[c,d,e] are overlapping.
So, especially for large windows, this can be significant performance issue
I think.

Any solution for this?


Thanks
Adrienne

Reply via email to