Hi TD, regarding to the performance of updateStateByKey, do you have a
JIRA for that so we can watch it? Thank you!

 

________________________________

From: Tathagata Das [mailto:t...@databricks.com] 
Sent: Wednesday, April 15, 2015 8:09 AM
To: Krzysztof Zarzycki
Cc: user
Subject: Re: Is it feasible to keep millions of keys in state of Spark
Streaming job for two months?

 

Fundamentally, stream processing systems are designed for processing
streams of data, not for storing large volumes of data for a long period
of time. So if you have to maintain that much state for months, then its
best to use another system that is designed for long term storage (like
Cassandra) which has proper support for making all that state
fault-tolerant, high-performant, etc. So yes, the best option is to use
Cassandra for the state and Spark Streaming jobs accessing the state
from Cassandra. There are a number of optimizations that can be done.
Its not too hard to build a simple on-demand populated cache (singleton
hash map for example), that speeds up access from Cassandra, and all
updates are written through the cache. This is a common use of Spark
Streaming + Cassandra/HBase. 

 

Regarding the performance of updateStateByKey, we are aware of the
limitations, and we will improve it soon :)

 

TD

 

 

On Tue, Apr 14, 2015 at 12:34 PM, Krzysztof Zarzycki
<k.zarzy...@gmail.com> wrote:

Hey guys, could you please help me with a question I asked on
Stackoverflow:
https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-mill
ions-of-keys-in-state-of-spark-streaming-job-for-two ?  I'll be really
grateful for your help!

I'm also pasting the question below:

I'm trying to solve a (simplified here) problem in Spark Streaming:
Let's say I have a log of events made by users, where each event is a
tuple (user name, activity, time), e.g.:

("user1", "view", "2015-04-14T21:04Z") ("user1", "click",
"2015-04-14T21:05Z")

Now I would like to gather events by user to do some analysis of that.
Let's say that output is some analysis of:

("user1", List(("view", "2015-04-14T21:04Z"),("click",
"2015-04-14T21:05Z"))

The events should be kept for even 2 months. During that time there
might be around 500 milionof such events, and millions of unique users,
which are keys here.

My questions are:

*                  Is it feasible to do such a thing with
updateStateByKey on DStream, when I have millions of keys stored?

*                  Am I right that DStream.window is no use here, when I
have 2 months length window and would like to have a slide of few
seconds?

P.S. I found out, that updateStateByKey is called on all the keys on
every slide, so that means it will be called millions of time every few
seconds. That makes me doubt in this design and I'm rather thinking
about alternative solutions like:

*                  using Cassandra for state

*                  using Trident state (with Cassandra probably)

*                  using Samza with its state management.

 

Reply via email to