Hi Ashish, Gordon (in CC) might be able to help you.
Cheers, Fabian 2017-11-05 16:24 GMT+01:00 Ashish Pokharel <ashish...@yahoo.com>: > All, > > I am starting to notice a strange behavior in a particular streaming app. > I initially thought it was a Producer issue as I was seeing timeout > exceptions (records expiring in queue. I did try to modify > request.timeout.ms, linger.ms etc to help with the issue if it were > caused by a sudden burst of data or something along those lines. However, > what it caused the app to increase back pressure and made the slower and > slower until that timeout is reached. With lower timeouts, app would > actually raise exception and recover faster. I can tell it is not related > to connectivity as other apps are running just fine around the same time > frame connected to same brokers (we have at least 10 streaming apps > connected to same list of brokers) from the same data nodes. We have > enabled Graphite Reporter in all of our applications. After deep diving > into some of consumer and producer stats, I noticed that consumer > fetch-rate drops tremendously while fetch-size grows exponentially BEFORE > the producer actually start to show higher response-time and lower rates. > Eventually, I noticed connection resets start to occur and connection > counts go up momentarily. After which, things get back to normal. Data > producer rates remain constant around that timeframe - we have Logstash > producer sending data over. We checked both Logstash and Kafka metrics and > they seem to be showing same pattern (sort of sin wave) throughout. > > It seems to point to Kafka issue (perhaps some tuning between Flink App > and Kafka) but wanted to check with the experts before I start knocking > down Kafka Admin’s doors. Are there anything else I can look into. There > are quite a few default stats in Graphite but those were the ones that made > most sense. > > Thanks, Ashish