An in-memory hash key data structure of some kind so that you're close to
linear on the number of items in a batch, not the number of outstanding
keys.  That's more complex, because you have to deal with expiration for
keys that never get hit, and for unusually long sessions you have to either
drop them or hit durable storage.

Maybe someone has a better idea, I'd like to hear it.

On Wed, Jul 15, 2015 at 8:54 AM, algermissen1971 <algermissen1...@icloud.com
> wrote:

> Hi Cody,
>
> oh ... I though that was one of *the* use cases for it. Do you have a
> suggestion / best practice how to achieve the same thing with better
> scaling characteristics?
>
> Jan
>
> On 15 Jul 2015, at 15:33, Cody Koeninger <c...@koeninger.org> wrote:
>
> > I personally would try to avoid updateStateByKey for sessionization when
> you have long sessions / a lot of keys, because it's linear on the number
> of keys.
> >
> > On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das <t...@databricks.com>
> wrote:
> > [Apologies for repost, for those who have seen this response already in
> the dev mailing list]
> >
> > 1. When you set ssc.checkpoint(checkpointDir), the spark streaming
> periodically saves the state RDD (which is a snapshot of all the state
> data) to HDFS using RDD checkpointing. In fact, a streaming app with
> updateStateByKey will not start until you set checkpoint directory.
> >
> > 2. The updateStateByKey performance is sort of independent of the what
> is the source that is being use - receiver based or direct Kafka. The
> absolutely performance obvious depends on a LOT of variables, size of the
> cluster, parallelization, etc. The key things is that you must ensure
> sufficient parallelization at every stage - receiving, shuffles
> (updateStateByKey included), and output.
> >
> > Some more discussion in my talk -
> https://www.youtube.com/watch?v=d5UJonrruHk
> >
> >
> >
> > On Tue, Jul 14, 2015 at 4:13 PM, swetha <swethakasire...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I have a question regarding sessionization using updateStateByKey. If
> near
> > real time state needs to be maintained in a Streaming application, what
> > happens when the number of RDDs to maintain the state becomes very large?
> > Does it automatically get saved to HDFS and reload when needed or do I
> have
> > to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
> > performance if I use both DStream Checkpointing for maintaining the state
> > and use Kafka Direct approach for exactly once semantics?
> >
> >
> > Thanks,
> > Swetha
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Sessionization-using-updateStateByKey-tp23838.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> >
>
>

Reply via email to