Hi I have a requirement for processing large events but ignoring duplicate at the same time.
Events are consumed from kafka and each event has a eventid. It may happen that an event is already processed and came again at some other offset. 1.Can I use Spark RDD to persist processed events and then lookup with this rdd (How to do lookup inside a RDD ?I have a JavaPairRDD<eventid,timestamp> ) while processing new events and if event is present in persisted rdd ignore it , else process the even. Does rdd.lookup(key) on billion of events will be efficient ? 2. update the rdd (Since RDD is immutable how to update it)? Thanks