My team recently had an issue that looks like it might be caused by some
backpressure from flume and I am having trouble finding information related to
We have a single source that is writing to three memory channels (one for each
DC we are sending to) and multiple sinks writing to the kafka clusters in each
DC. We had a connectivity issue where two of our DC’s were no longer able to
send to the kafka cluster in the third DC which caused that memory channel to
fill up. It appears that this one full channel has either create backpressure
on the process sending data into flume or slowed down the ability of flume to
read data in.
When our problem first started it looked like it might have been related to
boxes that had lower memory and then a significant amount of swap getting used
but we were seeing similar issues on other boxes that were never using any
swap. Looking back at the some of some of our grafana info during the issue
seems to show EventsReceived in the hundreds rather than in the 20k-40k range.
My team is using Flume 1.8.0.