DStreams stop consuming from Kafka

Razvan-Daniel Mihai Tue, 10 Nov 2020 09:03:16 -0800

Hello,

I have a usecase where I have to stream events from Kafka to a JDBC sink.
Kafka producers write events in bursts of hourly batches.


I started with a structured streaming approach, but it turns out that
structured streaming has no JDBC sink. I found an implementation in Apache
Bahir, but it's buggy and looks abandoned.

So I reimplemented the job using DStreams and everything works fine except
that the executors stop consuming anything once they've reached the latest
offsets. All future events are discarded. The last INFO level messages are
the lines of :

20/11/10 16:19:02 INFO KafkaRDD: Beginning offset 7908480 is the same as
ending offset skipping dev_applogs 10

Hier dev_applogs is the topic being consumed and 10 is the partition number.

I played with different versions of "auto.offset.reset" and
"enable.auto.commit" but they all lead to the same behaviour. The settings
I actually need for my usecase are:

auto.offset.reset=latest
enable.auto.commit=true

I use spark 2.4.0 and kafka 2.2.1.

Is this the expected behavior ? Shouldn't the spark executors poll the
Kafka partitions continuously for new offsets ? This is actually the
behaviour with DataStreamReader and it's what I also expected to find with
DStreams.

Thanks,
R.

DStreams stop consuming from Kafka

Reply via email to