Hi, I have a weird problem when processing high volumes of data through my 3-node, 3 topic, 4 partition Kafka cluster. I am not fully sure at which point the issue starts happening, but basically, after some time of processing lots of events (100 M or so in about 5 hr time span) with no issues, Kafka consumers stop receiving new messages, even though the Lag of reach consumer is about 55M events.
They sit idle for a few minutes (15 or so) then suddenly start getting events again - process about 50-100K events , I can see that the Lag decreases correspondingly, and then stop again.And this goes on and on. I did jstack trace and can see that all four Cosnumers (for each partition) are in the WAITING state, which I believe corresponds to them sitting inside of this loop:ConsumerIterator<byte[], byte[]> iter = kafkaStream.iterator(); while (iter.hasNext()) { // process event and send into another topic: producer.send(resultJsonEvent.getBytes()); .... } All events are small (just a few bytes, less than 1K) - so there is no chance of consumer fetch size being small than producer's, which is also confirmed by the fact that the consumers do wake up after a few minutes and process the next , very small, batch of events. Restarting Consumers helps sometimes (they can work without pausing for many hours) - and sometimes not, they keep following this pause/resume pattern.... I have to point that these Consumers that I am having this issue with also act as Producers, sending converted events to another topic.... And other Consumers that do not produce to KAfka do not have this issue. So this gave me an idea that it might be related to some weird combination of KAfka Consumer/Producer configuration... Any idea what else I could check to troubleshoot this? thanks a lot!MArina