I have a Spark Streaming app that reads using a reciever-less connection ( KafkaUtils.createDirectStream) with an interval of 1 minute. For about 15 hours it was running fine, ranging in input size of 3,861,758 to 16,836 events.
Then about 3 hours ago, every minute batch brought in the same number of records = 5,760 (2 topics, topic 1 = 64 partitions, topic 2 = 32 partitions). I know there is more data than the 5,760 records that being piped in, and eventually we’ll fall so far behind that our kafka offsets will not be available. It seems odd that 5760/96 (partitions) = 60 – or my interval in seconds. I do have spark.streaming.backpressure.enabled = true and even though the current documentation states it isn’t used I have a value set for spark.streaming.kafka.maxRatePerPartition. Has anyone else seen this issue where the rate seems capped even though it should be pulling more data? Thanks, Robert