I want to do hash based comparison to find duplicate records. Record which i
receive from stream will have hashid,recordid field in it.
1. I want to have all the historic records (hashid, recordid --> key,value)
in memory RDD
2. When a new record is received in spark DStream RDD i want to compare
Thank you @markcitizen . What I want to achieve is , say for an example
My historic rdd has
(Hash1, recordid1)
(Hash2,recordid2)
And in the new steam I have the following,
(Hash3, recordid3)
(Hash1,recordid5)
In this above scenario,
1) for recordid5,I should get recordid5 is duplicate of recordi
I am using pyspark stateful stream (2.0), which receives JSON from Socket. I
am getting the following error, When i send more then one records. meaning
if i send only one message i am getting response. If i send more than one
message getting following error,
def createmd5Hash(po):
data = json.