spark stream based deduplication

2016-09-21 Thread backtrack5
I want to do hash based comparison to find duplicate records. Record which i receive from stream will have hashid,recordid field in it. 1. I want to have all the historic records (hashid, recordid --> key,value) in memory RDD 2. When a new record is received in spark DStream RDD i want to compare

Re: spark stream based deduplication

2016-09-25 Thread backtrack5
Thank you @markcitizen . What I want to achieve is , say for an example My historic rdd has (Hash1, recordid1) (Hash2,recordid2) And in the new steam I have the following, (Hash3, recordid3) (Hash1,recordid5) In this above scenario, 1) for recordid5,I should get recordid5 is duplicate of recordi

spark stateful streaming error

2016-10-06 Thread backtrack5
I am using pyspark stateful stream (2.0), which receives JSON from Socket. I am getting the following error, When i send more then one records. meaning if i send only one message i am getting response. If i send more than one message getting following error, def createmd5Hash(po): data = json.