I want to do hash based comparison to find duplicate records. Record which i
receive from stream will have hashid,recordid field in it.

1. I want to have all the historic records (hashid, recordid --> key,value)
in memory RDD
2. When a new record is received in spark DStream RDD i want to compare
against the historic records (hash, recordid)
3. also add the new records into existing historic records (hashid, recordid
--> key,value) in memory RDD

My thoughts:

1. join the time based RDD and cache them in memory (historic lookup)
2. compare the new RDD comes, foreach record compare againt the historic

What I have done:

1. I have created a stream line and able to consume the records.
2. But i am not sure how to store it in memory

I have the following questions:

1. How can i achieve this or workaround ?
2. Can i do this using MLib? or spark stream fits for my usecase ?

View this message in context: 
Sent from the Apache Spark User List mailing list archive at Nabble.com.

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to