Kafka can keep track of the offsets (in a separate topic based on your consumer group) you've seen but it is usually best effort and you're probably better off also keeping track of your offsets.
If the producer resends a message you would have to dedupe it as you've most likely already seen it, how you handle that is dependent on your data. I think the offset will increment automatically, you will generally not see the same offset occur more than once in a Kafka topic partition, feel free to correct me on this though. So the most likely scenario you need to handle is if the producer sends a duplicate message with two offsets. The alternative is you can reprocess the offsets back from where you thought the message was last seen. Kind regards Chris On Mon, 3 Feb 2020, 7:39 pm Ruijing Li, <liruijin...@gmail.com> wrote: > Hi all, > > My use case is to read from single kafka topic using a batch spark sql job > (not structured streaming ideally). I want this batch job every time it > starts to get the last offset it stopped at, and start reading from there > until it caught up to the latest offset, store the result and stop the job. > Given the dataframe has a partition and offset column, my first thought for > offset management is to groupBy partition and agg the max offset, then > store it in HDFS. Next time the job runs, it will read and start from this > max offset using startingOffsets > > However, I was wondering if this will work. If the kafka producer failed > an offset and later decides to resend it, I will have skipped it since I’m > starting from the max offset sent. How does spark structured streaming know > to continue onwards - does it keep a state of all offsets seen? If so, how > can I replicate this for batch without missing data? Any help would be > appreciated. > > > -- > Cheers, > Ruijing Li >