On 15 Jul 2015, at 17:38, Cody Koeninger <c...@koeninger.org> wrote:

> An in-memory hash key data structure of some kind so that you're close to 
> linear on the number of items in a batch, not the number of outstanding keys. 
>  That's more complex, because you have to deal with expiration for keys that 
> never get hit, and for unusually long sessions you have to either drop them 
> or hit durable storage.

Thanks, yes. I do the expiration check already to terminate 'active' sessions 
and flush them to durable storage afterwards.

Excuse my Newbie-State: when docing this with my own data structure (e.g. such 
a hash), where should I execute the code that periodically checks the hash? 
Right now I am doing that in updateStateByKey - should I rather use foreachRDD?

And: if I understand you correctly, you are saying that updateStateByKey is 
more suitable for e.g. updating 'entities' of which a limited number exists 
(the users of the visits or the products sold). Yes?

Jan


> 
> Maybe someone has a better idea, I'd like to hear it.
> 
> On Wed, Jul 15, 2015 at 8:54 AM, algermissen1971 <algermissen1...@icloud.com> 
> wrote:
> Hi Cody,
> 
> oh ... I though that was one of *the* use cases for it. Do you have a 
> suggestion / best practice how to achieve the same thing with better scaling 
> characteristics?
> 
> Jan
> 
> On 15 Jul 2015, at 15:33, Cody Koeninger <c...@koeninger.org> wrote:
> 
>> I personally would try to avoid updateStateByKey for sessionization when you 
>> have long sessions / a lot of keys, because it's linear on the number of 
>> keys.
>> 
>> On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das <t...@databricks.com> wrote:
>> [Apologies for repost, for those who have seen this response already in the 
>> dev mailing list]
>> 
>> 1. When you set ssc.checkpoint(checkpointDir), the spark streaming 
>> periodically saves the state RDD (which is a snapshot of all the state data) 
>> to HDFS using RDD checkpointing. In fact, a streaming app with 
>> updateStateByKey will not start until you set checkpoint directory.
>> 
>> 2. The updateStateByKey performance is sort of independent of the what is 
>> the source that is being use - receiver based or direct Kafka. The 
>> absolutely performance obvious depends on a LOT of variables, size of the 
>> cluster, parallelization, etc. The key things is that you must ensure 
>> sufficient parallelization at every stage - receiving, shuffles 
>> (updateStateByKey included), and output.
>> 
>> Some more discussion in my talk - https://www.youtube.com/watch?v=d5UJonrruHk
>> 
>> 
>> 
>> On Tue, Jul 14, 2015 at 4:13 PM, swetha <swethakasire...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> I have a question regarding sessionization using updateStateByKey. If near
>> real time state needs to be maintained in a Streaming application, what
>> happens when the number of RDDs to maintain the state becomes very large?
>> Does it automatically get saved to HDFS and reload when needed or do I have
>> to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
>> performance if I use both DStream Checkpointing for maintaining the state
>> and use Kafka Direct approach for exactly once semantics?
>> 
>> 
>> Thanks,
>> Swetha
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Sessionization-using-updateStateByKey-tp23838.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>> 
>> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to