Hi TD, regarding to the performance of updateStateByKey, do you have a JIRA for that so we can watch it? Thank you!
________________________________ From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 8:09 AM To: Krzysztof Zarzycki Cc: user Subject: Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months? Fundamentally, stream processing systems are designed for processing streams of data, not for storing large volumes of data for a long period of time. So if you have to maintain that much state for months, then its best to use another system that is designed for long term storage (like Cassandra) which has proper support for making all that state fault-tolerant, high-performant, etc. So yes, the best option is to use Cassandra for the state and Spark Streaming jobs accessing the state from Cassandra. There are a number of optimizations that can be done. Its not too hard to build a simple on-demand populated cache (singleton hash map for example), that speeds up access from Cassandra, and all updates are written through the cache. This is a common use of Spark Streaming + Cassandra/HBase. Regarding the performance of updateStateByKey, we are aware of the limitations, and we will improve it soon :) TD On Tue, Apr 14, 2015 at 12:34 PM, Krzysztof Zarzycki <k.zarzy...@gmail.com> wrote: Hey guys, could you please help me with a question I asked on Stackoverflow: https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-mill ions-of-keys-in-state-of-spark-streaming-job-for-two ? I'll be really grateful for your help! I'm also pasting the question below: I'm trying to solve a (simplified here) problem in Spark Streaming: Let's say I have a log of events made by users, where each event is a tuple (user name, activity, time), e.g.: ("user1", "view", "2015-04-14T21:04Z") ("user1", "click", "2015-04-14T21:05Z") Now I would like to gather events by user to do some analysis of that. Let's say that output is some analysis of: ("user1", List(("view", "2015-04-14T21:04Z"),("click", "2015-04-14T21:05Z")) The events should be kept for even 2 months. During that time there might be around 500 milionof such events, and millions of unique users, which are keys here. My questions are: * Is it feasible to do such a thing with updateStateByKey on DStream, when I have millions of keys stored? * Am I right that DStream.window is no use here, when I have 2 months length window and would like to have a slide of few seconds? P.S. I found out, that updateStateByKey is called on all the keys on every slide, so that means it will be called millions of time every few seconds. That makes me doubt in this design and I'm rather thinking about alternative solutions like: * using Cassandra for state * using Trident state (with Cassandra probably) * using Samza with its state management.