Re: Lookup / Access of master data in spark streaming

Tathagata Das Thu, 09 Apr 2015 12:15:50 -0700

Responses inline. Hope they help.

On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani <[email protected]> wrote:


>  Hi Friends,
>
>  I am trying to solve a use case in spark streaming, I need help on
> getting to right approach on lookup / update the master data.
>
>  Use case ( simplified )
>  I’ve a dataset of entity with three attributes and identifier/row key in
> a persistent store.
>
>  Each attribute along with row key come from a different stream let’s
> say, effectively 3 source streams.
>
>  Now whenever any attribute comes up, I want to update/sync the
> persistent store and do some processing, but the processing would require
> the latest state of entity with latest values of three attributes.
>
>  I wish if I have the all the entities cached in some sort of centralized
> cache ( like we have data in hdfs ) within spark streaming which may be
> used for data local processing. But I assume there is no such thing.
>
>  potential approaches I m thinking of, I suspect first two are not
> feasible, but I want to confirm,
>        1.  Is Broadcast Variables mutable ? If yes, can I use it as cache
> for all entities sizing  around 100s of GBs provided i have a cluster with
> enough RAM.
>

Broadcast variables are not mutable. But you can always create a new
broadcast variable when you want and use the "latest" broadcast variable in
your computation.

dstream.transform { rdd =>

   val latestBroacast = getLatestBroadcastVariable()  // fetch existing or
update+create new and return
   val transformedRDD = rdd. ......  // use  latestBroacast in RDD
tranformations
   transformedRDD
}

Since the transform RDD-to-RDD function runs on the driver every batch
interval, it will always use the latest broadcast variable that you want.
Though note that whenever you create a new broadcast, the next batch may
take a little longer to as the data needs to be actually broadcasted out.
That can also be made asynchronous by running a simple task (to force the
broadcasting out) on any new broadcast variable in a different thread as
Spark Streaming batch schedule, but using the same underlying Spark Context.



>
>    1. Is there any kind of sticky partition possible, so that I route my
>    stream data to go through the same node where I've the corresponding
>    entities, subset of entire store, cached in memory within JVM / off heap on
>    the node, this would avoid lookups from store.
>
> You could use updateStateByKey. That is quite sticky, but does not
eliminate the possibility that it can run on a different node. In fact this
is necessary for fault-tolerance - what if the node it was supposed to run
goes down? The task will be run on a different node, and you have to
 design your application such that it can handle that.


>    1. If I stream the entities from persistent store into engine, this
>    becomes 4th stream - the entity stream, how do i use join / merge to enable
>    stream 1,2,3 to lookup and update the data from stream 4. Would
>    DStream.join work for few seconds worth of data in attribute streams with
>    all data in entity stream ? Or do I use transform and within that use rdd
>    join, I’ve doubts if I am leaning towards core spark approach in spark
>    streaming ?
>
>
Depends on what kind of join! If you want the join every batch in stream
with a static data set (or rarely updated dataset), the transform+join is
the way to go. If you want to join one stream with a window of data from
another stream, then DStream.join is the way to go.

>
>    1.
>
>
>    1. The last approach, which i think will surely work but i want to
>    avoid, is i keep the entities in IMDB and do lookup/update calls on from
>    stream 1,2 and 3.
>
>
>   Any help is deeply appreciated as this would help me design my system
> efficiently and the solution approach may become a beacon for lookup master
> data sort of problems.
>
>  Regards,
>  Amit
>
> ------------------------------
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>

Re: Lookup / Access of master data in spark streaming

Reply via email to