You can use Spark Streaming's updateStateByKey to do arbitrary sessionization. See the example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala All it does is store a single number (count of each word seeing since the beginning), but you can extend it store arbitrary state data. And then you can use that state to keep track of gaps, windows, etc. People have done a lot sessionization using this. I am sure others can chime in.
On Fri, Aug 7, 2015 at 10:48 AM, Ankur Chauhan <an...@malloc64.com> wrote: > Hi all, > > I am trying to figure out how to perform equivalent of "Session windows" > (as mentioned in https://cloud.google.com/dataflow/model/windowing) using > spark streaming. Is it even possible (i.e. possible to do efficiently at > scale). Just to expand on the definition: > > Taken from the google dataflow documentation: > > The simplest kind of session windowing specifies a minimum gap duration. > All data arriving below a minimum threshold of time delay is grouped into > the same window. If data arrives after the minimum specified gap duration > time, this initiates the start of a new window. > > > > > Any help would be appreciated. > > -- Ankur Chauhan >