Hi, I have a weird problem when processing high volumes of data through my 
3-node, 3 topic, 4 partition Kafka cluster.
I am not fully sure at which point the issue starts happening, but basically, 
after some time of processing lots of events (100 M or so in about 5 hr time 
span) with no issues, Kafka consumers stop receiving new messages, even though 
the Lag of reach consumer is about 55M events.

They sit idle for a few minutes (15 or so) then suddenly start getting events 
again - process about 50-100K events , I can see that the Lag decreases 
correspondingly, and then stop again.And this goes on and on. 

I did jstack trace and can see that all four Cosnumers (for each partition) are 
in the WAITING state, which I believe corresponds to them sitting inside of 
this loop:ConsumerIterator<byte[], byte[]> iter = kafkaStream.iterator();
                    while (iter.hasNext()) {
         // process event and send into another topic:         
producer.send(resultJsonEvent.getBytes());

 .... }
All events are small (just a few bytes, less than 1K) - so there is no chance 
of consumer fetch size being small than producer's, which is also confirmed by 
the fact that the consumers do wake up after a few minutes and process the next 
, very small, batch of events.
Restarting Consumers helps sometimes (they can work without pausing for many 
hours) - and sometimes not, they keep following this pause/resume pattern....
I have to point that these Consumers that I am having this issue with also act 
as Producers, sending converted events to another topic.... And other Consumers 
that do not produce to KAfka do not have this issue. So this gave me an idea 
that it might be related to some weird combination of KAfka Consumer/Producer 
configuration...
Any idea what else I could check to troubleshoot this?
thanks a lot!MArina

Reply via email to