On Tue, Apr 14, 2015 at 8:31 PM, Jiangjie Qin <j...@linkedin.com.invalid> wrote:
> Hey Evan, > > Is this issue only observed when mirror maker is consuming? It looks that > for Cluster A you have some other consumers. > Do you mean if you stop mirror maker the problem goes away? > Yes, exactly. The setup is A -> Mirrormaker -> B so mirrormaker is consuming from A and producing to B. Cluster A is always fine. Cluster B is fine when mirrormaker is stopped. Cluster B has the weird purgatory issue when mirrormaker is running. Today I rolled out a change to reduce the `fetch.purgatory.purge.interval.requests` and `producer.purgatory.purge.interval.requests` configuration values on cluster B from 1000 to 200, but it had no effect, which I find really weird. Thanks, Evan > Jiangjie (Becket) Qin > > On 4/14/15, 6:55 AM, "Evan Huus" <evan.h...@shopify.com> wrote: > > >Any ideas on this? It's still occurring... > > > >Is there a separate mailing list or project for mirrormaker that I could > >ask? > > > >Thanks, > >Evan > > > >On Thu, Apr 9, 2015 at 4:36 PM, Evan Huus <evan.h...@shopify.com> wrote: > > > >> Hey Folks, we're running into an odd issue with mirrormaker and the > >>fetch > >> request purgatory on the brokers. Our setup consists of two six-node > >> clusters (all running 0.8.2.1 on identical hw with the same config). All > >> "normal" producing and consuming happens on cluster A. Mirrormaker has > >>been > >> set up to copy all topics (except a tiny blacklist) from cluster A to > >> cluster B. > >> > >> Cluster A is completely healthy at the moment. Cluster B is not, which > >>is > >> very odd since it is literally handling the exact same traffic. > >> > >> The graph for Fetch Request Purgatory Size looks like this: > >> > >> > https://www.dropbox.com/s/k87wyhzo40h8gnk/Screenshot%202015-04-09%2016.08 > >>.37.png?dl=0 > >> > >> Every time the purgatory shrinks, the latency from that causes one or > >>more > >> nodes to drop their leadership (it quickly recovers). We could probably > >> alleviate the symptoms by decreasing > >> `fetch.purgatory.purge.interval.requests` (it is currently at the > >>default > >> value) but I'd rather try and understand/solve the root cause here. > >> > >> Cluster B is handling no outside fetch requests, and turning mirrormaker > >> off "fixes" the problem, so clearly (since mirrormaker is producing to > >>this > >> cluster not consuming from it) the fetch requests must be coming from > >> internal replication. However, the same data is being replicated when > >>it is > >> originally produced in cluster A, and the fetch purgatory size sits > >>stably > >> at ~10k there. There is nothing unusual in the logs on either cluster. > >> > >> I have read all the wiki pages and jira tickets I can find about the new > >> purgatory design in 0.8.2 but nothing stands out as applicable. I'm > >>happy > >> to provide more detailed logs, configuration, etc. if anyone thinks > >>there > >> might be something important in there. I am completely baffled as to > >>what > >> could be causing this. > >> > >> Any suggestions would be appreciated. I'm starting to think at this > >>point > >> that we've completely misunderstood or misconfigured *something*. > >> > >> Thanks, > >> Evan > >> > >