This is a good usecase for using DStream.updateStateByKey! This allows you to maintain arbitrary per-key state. Checkout this example. https://github.com/tdas/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala
Also take a look at the documentation for more information: This example just keeps the an integer (the running count) as the state, but in practice it can be anything. In your case, you can keep that per-group (i.e. per-key) threshold as the state. In the update function, for each record of a group, you can check the timestamp against the threshold and push stuff to DB and maybe update the threshold. TD On Thu, Apr 17, 2014 at 1:24 PM, xargsgrep <ahs...@gmail.com> wrote: > Hi, I'm completely new to Spark streaming (and Spark) and have been reading > up on it and trying out various examples the past few days. I have a > particular use case which I think it would work well for, but I wanted to > put it out there and get some feedback on whether or not it actually would. > The use case is: > > We have web tracking data continuously coming in from a pool of web > servers. > For simplicity, let's just say the data is text lines with a known set of > fields, eg: "timestamp userId domain ...". What I want to do is: > 1. group this continuous stream of data by "userId:domain", and > 2. when the latest timestamp in each group is older than a certain > threshold, persist the results to a DB > > #1 is straightforward and there are plenty of examples showing how to do > it. > However, I'm not sure how I would go about doing #2, or if that's something > I can even do with spark because as far as I can tell it operates on > sliding > windows. I really just want to continue to accumulate these groups of > "userId:domain" for all time (without specifying a window) and then roll > them up and flush them once no new data has come in for a group after a > certain amount of time. Would the updateStateByKey function allow me to do > this somehow? > > Any help would be appreciated. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Valid-spark-streaming-use-case-tp4410.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >