Thanks, I played around with that example and had some followup questions. 1. The only way I was able to accumulate data per-key was to actually store all the data in the state, not just the timestamp (see example below). Otherwise I don't have access to data older than the batchDuration of the StreamingContext. This creates a concern about memory usage and what would happen if a node crashes, and how partitioning and replication work across spark nodes. Does it store the state on disk ever? Again, what I want to do is aggregate these sets of text lines by key and when a key has been "inactive" (ie, no new data has been received for that key) for a certain amount of time (eg 30 minutes), then finally save them somewhere and remove them from the state.
2. Is the update function called only for keys that are in the current batchDuration's stream or for all keys that exist? If it's the former, how can I check the timestamp of keys from an older batch that never appear in the stream again? Example: batchDuration = 10 minutes timestampThreshold = 30 minutes data: 09:00:00 user-one foo.com 09:09:00 user-one foo.com 09:15:00 user-two bar.com 09:18:00 user-one foo.com 09:25:00 user-two bar.com So given this set of data there are 3 batches and the state would look something like: batch1: { "user-one:foo.com" : ("09:00:00 user-one foo.com", "09:09:00 user-one foo.com") } batch2: { "user-one:foo.com" : ("09:00:00 user-one foo.com", "09:09:00 user-one foo.com", "09:18:00 user-one foo.com"), "user-two:bar.com" : ("09:15:00 user-two bar.com") } batch3: { "user-one:foo.com" : ("09:00:00 user-one foo.com", "09:09:00 user-one foo.com", "09:18:00 user-one foo.com"), "user-two:bar.com" : ("09:15:00 user-two bar.com", "09:25:00 user-two bar.com") } Now let's assume no more data comes in for either of the two keys above. Since the latest timestamp threshold is 30 minutes, "user-one:foo.com" should be flushed after 9:48 and "user-two:bar.com" should be flushed after 9:55. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Valid-spark-streaming-use-case-tp4410p4455.html Sent from the Apache Spark User List mailing list archive at Nabble.com.