Hey guys, could you please help me with a question I asked on
Stackoverflow:
https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two
?  I'll be really grateful for your help!

I'm also pasting the question below:

I'm trying to solve a (simplified here) problem in Spark Streaming: Let's
say I have a log of events made by users, where each event is a tuple (user
name, activity, time), e.g.:

("user1", "view", "2015-04-14T21:04Z") ("user1", "click",
"2015-04-14T21:05Z")

Now I would like to gather events by user to do some analysis of that.
Let's say that output is some analysis of:

("user1", List(("view", "2015-04-14T21:04Z"),("click", "2015-04-14T21:05Z"))

The events should be kept for even *2 months*. During that time there might
be around *500 milion*of such events, and *millions of unique* users, which
are keys here.

*My questions are:*

   - Is it feasible to do such a thing with updateStateByKey on DStream,
   when I have millions of keys stored?
   - Am I right that DStream.window is no use here, when I have 2 months
   length window and would like to have a slide of few seconds?

P.S. I found out, that updateStateByKey is called on all the keys on every
slide, so that means it will be called millions of time every few seconds.
That makes me doubt in this design and I'm rather thinking about
alternative solutions like:

   - using Cassandra for state
   - using Trident state (with Cassandra probably)
   - using Samza with its state management.

Reply via email to