Thank you Tathagata, very helpful answer. Though, I would like to highlight that recent stream processing systems are trying to help users in implementing use case of holding such large (like 2 months of data) states. I would mention here Samza state management <http://samza.apache.org/learn/documentation/0.9/container/state-management.html> and Trident state management <https://storm.apache.org/documentation/Trident-state>. I'm waiting when Spark would help with that too, because generally I definitely prefer this technology:)
But considering holding state in Cassandra with Spark Streaming, I understand we're not talking here about using Cassandra as input nor output (nor make use of spark-cassandra-connector <https://github.com/datastax/spark-cassandra-connector>). We're talking here about querying Cassandra from map/mapPartition functions. I have one question about it: Is it possible to query Cassandra asynchronously within Spark Streaming? And while doing it, is it possible to take next batch of rows, while the previous is waiting on Cassandra I/O? I think (but I'm not sure) this generally asks, whether several consecutive windows can interleave (because they are long to process)? Let's draw it: <------|query Cassandra asynchronously--- > window1 <---------------------------------------> window2 While writing it, I start to believe they can, because windows are time-triggered, not triggered when previous window has finished... But it's better to ask:) 2015-04-15 2:08 GMT+02:00 Tathagata Das <t...@databricks.com>: > Fundamentally, stream processing systems are designed for processing > streams of data, not for storing large volumes of data for a long period of > time. So if you have to maintain that much state for months, then its best > to use another system that is designed for long term storage (like > Cassandra) which has proper support for making all that state > fault-tolerant, high-performant, etc. So yes, the best option is to use > Cassandra for the state and Spark Streaming jobs accessing the state from > Cassandra. There are a number of optimizations that can be done. Its not > too hard to build a simple on-demand populated cache (singleton hash map > for example), that speeds up access from Cassandra, and all updates are > written through the cache. This is a common use of Spark Streaming + > Cassandra/HBase. > > Regarding the performance of updateStateByKey, we are aware of the > limitations, and we will improve it soon :) > > TD > > > On Tue, Apr 14, 2015 at 12:34 PM, Krzysztof Zarzycki <k.zarzy...@gmail.com > > wrote: > >> Hey guys, could you please help me with a question I asked on >> Stackoverflow: >> https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two >> ? I'll be really grateful for your help! >> >> I'm also pasting the question below: >> >> I'm trying to solve a (simplified here) problem in Spark Streaming: Let's >> say I have a log of events made by users, where each event is a tuple (user >> name, activity, time), e.g.: >> >> ("user1", "view", "2015-04-14T21:04Z") ("user1", "click", >> "2015-04-14T21:05Z") >> >> Now I would like to gather events by user to do some analysis of that. >> Let's say that output is some analysis of: >> >> ("user1", List(("view", "2015-04-14T21:04Z"),("click", >> "2015-04-14T21:05Z")) >> >> The events should be kept for even *2 months*. During that time there >> might be around *500 milion*of such events, and *millions of unique* users, >> which are keys here. >> >> *My questions are:* >> >> - Is it feasible to do such a thing with updateStateByKey on DStream, >> when I have millions of keys stored? >> - Am I right that DStream.window is no use here, when I have 2 months >> length window and would like to have a slide of few seconds? >> >> P.S. I found out, that updateStateByKey is called on all the keys on >> every slide, so that means it will be called millions of time every few >> seconds. That makes me doubt in this design and I'm rather thinking about >> alternative solutions like: >> >> - using Cassandra for state >> - using Trident state (with Cassandra probably) >> - using Samza with its state management. >> >> >