Hi All, My problem is as explained,
Environment: Spark 2.2.0 installed on CDH Use-Case: Reading from Kafka, cleansing the data and ingesting into a non updatable database. Problem: My streaming batch duration is 1 minute and I am receiving 3000 messages/min. I am observing a weird case where, in the map transformations some of the messages are being reprocessed more than once to the downstream transformations. Because of this I have been seeing duplicates in the downstream insert only database. It would have made sense if the reprocessing of the message happens for the entire task in which case I would have assumed the problem is because of the task failure. But, in my case I don't see any task failures and only one or two particular messages in the task will be reprocessed. Everytime I relaunch the spark job to process kafka messages from the starting offset, it would dup the exact same messages all the time irrespective of number of relaunches. I added the messages that are getting duped back to kafka at a different offset to see if I would observe the same problem, but this time it won't dup. Workaround for now: As a workaround for now, I added a cache at the end before ingestion into DB which gets updated processed event and thus making sure it won't be reprocessed again. My question here is, why am I seeing this weird behavior(only one particular message in the entire batch getting reprocessed again)? Is there some configuration that would help me fix this problem or is this a bug? Any solution apart from maintaining a cache would be of great help. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org