Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

Krzysztof Zarzycki Tue, 14 Apr 2015 22:39:36 -0700

Thank you Tathagata, very helpful answer.

Though, I would like to highlight that recent stream processing systems are
trying to help users in implementing use case of holding such large (like 2
months of data) states. I would mention here Samza state management
<http://samza.apache.org/learn/documentation/0.9/container/state-management.html>
and
Trident state management
<https://storm.apache.org/documentation/Trident-state>. I'm waiting when
Spark would help with that too, because generally I definitely prefer this
technology:)


But considering holding state in Cassandra with Spark Streaming, I
understand we're not talking here about using Cassandra as input nor output
(nor make use of spark-cassandra-connector
<https://github.com/datastax/spark-cassandra-connector>). We're talking
here about querying Cassandra from map/mapPartition functions.
I have one question about it: Is it possible to query Cassandra
asynchronously within Spark Streaming? And while doing it, is it possible
to take next batch of rows, while the previous is waiting on Cassandra I/O?
I think (but I'm not sure) this generally asks, whether several consecutive
windows can interleave (because they are long to process)? Let's draw it:

<------|query Cassandra asynchronously--- > window1
        <---------------------------------------> window2

While writing it, I start to believe they can, because windows are
time-triggered, not triggered when previous window has finished... But it's
better to ask:)




2015-04-15 2:08 GMT+02:00 Tathagata Das <t...@databricks.com>:

> Fundamentally, stream processing systems are designed for processing
> streams of data, not for storing large volumes of data for a long period of
> time. So if you have to maintain that much state for months, then its best
> to use another system that is designed for long term storage (like
> Cassandra) which has proper support for making all that state
> fault-tolerant, high-performant, etc. So yes, the best option is to use
> Cassandra for the state and Spark Streaming jobs accessing the state from
> Cassandra. There are a number of optimizations that can be done. Its not
> too hard to build a simple on-demand populated cache (singleton hash map
> for example), that speeds up access from Cassandra, and all updates are
> written through the cache. This is a common use of Spark Streaming +
> Cassandra/HBase.
>
> Regarding the performance of updateStateByKey, we are aware of the
> limitations, and we will improve it soon :)
>
> TD
>
>
> On Tue, Apr 14, 2015 at 12:34 PM, Krzysztof Zarzycki <k.zarzy...@gmail.com
> > wrote:
>
>> Hey guys, could you please help me with a question I asked on
>> Stackoverflow:
>> https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two
>> ?  I'll be really grateful for your help!
>>
>> I'm also pasting the question below:
>>
>> I'm trying to solve a (simplified here) problem in Spark Streaming: Let's
>> say I have a log of events made by users, where each event is a tuple (user
>> name, activity, time), e.g.:
>>
>> ("user1", "view", "2015-04-14T21:04Z") ("user1", "click",
>> "2015-04-14T21:05Z")
>>
>> Now I would like to gather events by user to do some analysis of that.
>> Let's say that output is some analysis of:
>>
>> ("user1", List(("view", "2015-04-14T21:04Z"),("click",
>> "2015-04-14T21:05Z"))
>>
>> The events should be kept for even *2 months*. During that time there
>> might be around *500 milion*of such events, and *millions of unique* users,
>> which are keys here.
>>
>> *My questions are:*
>>
>>    - Is it feasible to do such a thing with updateStateByKey on DStream,
>>    when I have millions of keys stored?
>>    - Am I right that DStream.window is no use here, when I have 2 months
>>    length window and would like to have a slide of few seconds?
>>
>> P.S. I found out, that updateStateByKey is called on all the keys on
>> every slide, so that means it will be called millions of time every few
>> seconds. That makes me doubt in this design and I'm rather thinking about
>> alternative solutions like:
>>
>>    - using Cassandra for state
>>    - using Trident state (with Cassandra probably)
>>    - using Samza with its state management.
>>
>>
>

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

Reply via email to