Hi all, I am using Spark Streaming for my use case. I want to - partition or group the stream by key - window the tuples in partitions and - find max/min element in windows (in every partition)
My code is like: val keyedStream = socketDataSource.map(s => (s.key,s.value)) // define key and value val aggregatedStream = keyedStream.groupByKeyAndWindow(Milliseconds(8000),Milliseconds(4000)) // partition stream by key and window .map(window=>minMaxTuples(window)) // reduce the window to find max/min element In minMaxTuples function I use window._2.toArray.maxBy/minBy to find max/min element. Maybe it is not the right way to do , if yes please correct me, but what I realize inside minMaxTuples function is that, we are not reusing previously computed results. So, in the minibatch-n if we have a keyed window {key-1, [a,b,c,d,e]} and we iterate for all elements( [a,b,c,d,e]) to find the result, in the next minibatch (minibatch-n+1) we may have {key-1, [c,d,e,f,g]}, in which [c,d,e] are overlapping. So, especially for large windows, this can be significant performance issue I think. Any solution for this? Thanks Adrienne