I personally would try to avoid updateStateByKey for sessionization when
you have long sessions / a lot of keys, because it's linear on the number
of keys.

On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das <t...@databricks.com> wrote:

> [Apologies for repost, for those who have seen this response already in
> the dev mailing list]
>
> 1. When you set ssc.checkpoint(checkpointDir), the spark streaming
> periodically saves the state RDD (which is a snapshot of all the state
> data) to HDFS using RDD checkpointing. In fact, a streaming app with
> updateStateByKey will not start until you set checkpoint directory.
>
> 2. The updateStateByKey performance is sort of independent of the what is
> the source that is being use - receiver based or direct Kafka. The
> absolutely performance obvious depends on a LOT of variables, size of the
> cluster, parallelization, etc. The key things is that you must ensure
> sufficient parallelization at every stage - receiving, shuffles
> (updateStateByKey included), and output.
>
> Some more discussion in my talk -
> https://www.youtube.com/watch?v=d5UJonrruHk
>
>
>
> On Tue, Jul 14, 2015 at 4:13 PM, swetha <swethakasire...@gmail.com> wrote:
>
>>
>> Hi,
>>
>> I have a question regarding sessionization using updateStateByKey. If near
>> real time state needs to be maintained in a Streaming application, what
>> happens when the number of RDDs to maintain the state becomes very large?
>> Does it automatically get saved to HDFS and reload when needed or do I
>> have
>> to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
>> performance if I use both DStream Checkpointing for maintaining the state
>> and use Kafka Direct approach for exactly once semantics?
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Sessionization-using-updateStateByKey-tp23838.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to