Hey Folks, we're running into an odd issue with mirrormaker and the fetch
request purgatory on the brokers. Our setup consists of two six-node
clusters (all running 0.8.2.1 on identical hw with the same config). All
"normal" producing and consuming happens on cluster A. Mirrormaker has been
set up to copy all topics (except a tiny blacklist) from cluster A to
cluster B.

Cluster A is completely healthy at the moment. Cluster B is not, which is
very odd since it is literally handling the exact same traffic.

The graph for Fetch Request Purgatory Size looks like this:
https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08.37.png?dl=0

Every time the purgatory shrinks, the latency from that causes one or more
nodes to drop their leadership (it quickly recovers). We could probably
alleviate the symptoms by decreasing
`fetch.purgatory.purge.interval.requests` (it is currently at the default
value) but I'd rather try and understand/solve the root cause here.

Cluster B is handling no outside fetch requests, and turning mirrormaker
off "fixes" the problem, so clearly (since mirrormaker is producing to this
cluster not consuming from it) the fetch requests must be coming from
internal replication. However, the same data is being replicated when it is
originally produced in cluster A, and the fetch purgatory size sits stably
at ~10k there. There is nothing unusual in the logs on either cluster.

I have read all the wiki pages and jira tickets I can find about the new
purgatory design in 0.8.2 but nothing stands out as applicable. I'm happy
to provide more detailed logs, configuration, etc. if anyone thinks there
might be something important in there. I am completely baffled as to what
could be causing this.

Any suggestions would be appreciated. I'm starting to think at this point
that we've completely misunderstood or misconfigured *something*.

Thanks,
Evan

Reply via email to