Problems w/YARN Spark Streaming app reading from Kafka

Robert Towne Fri, 16 Oct 2015 11:53:15 -0700

I have a Spark Streaming app that reads using a reciever-less connection ( 
KafkaUtils.createDirectStream) with an interval of 1 minute.
For about 15 hours it was running fine, ranging in input size of 3,861,758 to 
16,836 events.


Then about 3 hours ago, every minute batch brought in the same number of 
records = 5,760 (2 topics, topic 1 = 64 partitions, topic 2 = 32 partitions).

I know there is more data than the 5,760 records that being piped in, and 
eventually we’ll fall so far behind that our kafka offsets will not be 
available.
It seems odd that 5760/96 (partitions) = 60 – or my interval in seconds.

I do have spark.streaming.backpressure.enabled = true and even though the 
current documentation states it isn’t used I have a value set for 
spark.streaming.kafka.maxRatePerPartition.



Has anyone else seen this issue where the rate seems capped even though it 
should be pulling more data?

Thanks,
Robert

Problems w/YARN Spark Streaming app reading from Kafka

Reply via email to