Hey Evan, Is this issue only observed when mirror maker is consuming? It looks that for Cluster A you have some other consumers. Do you mean if you stop mirror maker the problem goes away?
Jiangjie (Becket) Qin On 4/14/15, 6:55 AM, "Evan Huus" <evan.h...@shopify.com> wrote: >Any ideas on this? It's still occurring... > >Is there a separate mailing list or project for mirrormaker that I could >ask? > >Thanks, >Evan > >On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <evan.h...@shopify.com> wrote: > >> Hey Folks, we're running into an odd issue with mirrormaker and the >>fetch >> request purgatory on the brokers. Our setup consists of two six-node >> clusters (all running 0.8.2.1 on identical hw with the same config). All >> "normal" producing and consuming happens on cluster A. Mirrormaker has >>been >> set up to copy all topics (except a tiny blacklist) from cluster A to >> cluster B. >> >> Cluster A is completely healthy at the moment. Cluster B is not, which >>is >> very odd since it is literally handling the exact same traffic. >> >> The graph for Fetch Request Purgatory Size looks like this: >> >>https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08 >>.37.png?dl=0 >> >> Every time the purgatory shrinks, the latency from that causes one or >>more >> nodes to drop their leadership (it quickly recovers). We could probably >> alleviate the symptoms by decreasing >> `fetch.purgatory.purge.interval.requests` (it is currently at the >>default >> value) but I'd rather try and understand/solve the root cause here. >> >> Cluster B is handling no outside fetch requests, and turning mirrormaker >> off "fixes" the problem, so clearly (since mirrormaker is producing to >>this >> cluster not consuming from it) the fetch requests must be coming from >> internal replication. However, the same data is being replicated when >>it is >> originally produced in cluster A, and the fetch purgatory size sits >>stably >> at ~10k there. There is nothing unusual in the logs on either cluster. >> >> I have read all the wiki pages and jira tickets I can find about the new >> purgatory design in 0.8.2 but nothing stands out as applicable. I'm >>happy >> to provide more detailed logs, configuration, etc. if anyone thinks >>there >> might be something important in there. I am completely baffled as to >>what >> could be causing this. >> >> Any suggestions would be appreciated. I'm starting to think at this >>point >> that we've completely misunderstood or misconfigured *something*. >> >> Thanks, >> Evan >>