Hi All,

My problem is as explained,

Environment: Spark 2.2.0 installed on CDH
Use-Case: Reading from Kafka, cleansing the data and ingesting into a non
updatable database.

Problem: My streaming batch duration is 1 minute and I am receiving 3000
messages/min. I am observing a weird case where, in the map transformations
some of the messages are being reprocessed more than once to the downstream
transformations. Because of this I have been seeing duplicates in the
downstream insert only database.

It would have made sense if the reprocessing of the message happens for the
entire task in which case I would have assumed the problem is because of the
task failure. But, in my case I don't see any task failures and only one or
two particular messages in the task will be reprocessed. 

Everytime I relaunch the spark job to process kafka messages from the
starting offset, it would dup the exact same messages all the time
irrespective of number of relaunches.

I added the messages that are getting duped back to kafka at a different
offset to see if I would observe the same problem, but this time it won't
dup.

Workaround for now: 
As a workaround for now, I added a cache at the end before ingestion into DB
which gets updated processed event and thus making sure it won't be
reprocessed again.


My question here is, why am I seeing this weird behavior(only one particular
message in the entire batch getting reprocessed again)? Is there some
configuration that would help me fix this problem or is this a bug? 

Any solution apart from maintaining a cache would be of great help.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to