Hi Friends,
I am trying to solve a use case in spark streaming, I need help on getting to
right approach on lookup / update the master data.
Use case ( simplified )
I've a dataset of entity with three attributes and identifier/row key in a
persistent store.
Each attribute along with row key come from a different stream let's say,
effectively 3 source streams.
Now whenever any attribute comes up, I want to update/sync the persistent store
and do some processing, but the processing would require the latest state of
entity with latest values of three attributes.
I wish if I have the all the entities cached in some sort of centralized cache
( like we have data in hdfs ) within spark streaming which may be used for data
local processing. But I assume there is no such thing.
potential approaches I m thinking of, I suspect first two are not feasible, but
I want to confirm,
1. Is Broadcast Variables mutable ? If yes, can I use it as cache for
all entities sizing around 100s of GBs provided i have a cluster with enough
RAM.
1. Is there any kind of sticky partition possible, so that I route my stream
data to go through the same node where I've the corresponding entities, subset
of entire store, cached in memory within JVM / off heap on the node, this would
avoid lookups from store.
2. If I stream the entities from persistent store into engine, this becomes
4th stream - the entity stream, how do i use join / merge to enable stream
1,2,3 to lookup and update the data from stream 4. Would DStream.join work for
few seconds worth of data in attribute streams with all data in entity stream ?
Or do I use transform and within that use rdd join, I've doubts if I am leaning
towards core spark approach in spark streaming ?
1. The last approach, which i think will surely work but i want to avoid, is
i keep the entities in IMDB and do lookup/update calls on from stream 1,2 and 3.
Any help is deeply appreciated as this would help me design my system
efficiently and the solution approach may become a beacon for lookup master
data sort of problems.
Regards,
Amit
________________________________
NOTE: This message may contain information that is confidential, proprietary,
privileged or otherwise protected by law. The message is intended solely for
the named addressee. If received in error, please destroy and notify the
sender. Any use of this email is prohibited when received in error. Impetus
does not represent, warrant and/or guarantee, that the integrity of this
communication has been maintained nor that the communication is free of errors,
virus, interception or interference.