spark as a lookup engine for dedup

Shushant Arora Sun, 26 Jul 2015 09:39:06 -0700

Hi

I have a requirement for processing large events but ignoring duplicate at
the same time.


Events are consumed from kafka and each event has a eventid. It may happen
that an event is already processed and came again at some other offset.

1.Can I use Spark RDD to persist processed events and then lookup with this
rdd (How to do lookup inside a RDD ?I have a JavaPairRDD<eventid,timestamp>
)
while processing new events and if event is present in  persisted rdd
ignore it , else process the even. Does rdd.lookup(key) on billion of
events will be efficient ?

2. update the rdd (Since RDD is immutable  how to update it)?

Thanks

spark as a lookup engine for dedup

Reply via email to