Hello, Hope everyone is doing well. I was hoping to get some assistance with a strange issue we're experiencing while using the MirrorMaker to pull data down from an 8 node Kafka cluster in AWS into our data center. Both Kafka clusters and the mirror are using version 0.8.1.1 with dedicated Zookeeper clusters for each cluster respectively (running 3.4.5).
The problem we're seeing is that the mirror starts up and begins consuming from the cluster on a specific topic. It correctly attaches to all 24 partitions for that topic - but inevitably there are a series of partitions that either don't get read or are read at a very slow rate. Those partitions are always associated with the same brokers. For example, all partitions on broker 2 won't be read or all partitions on broker 2 and 4 won't be read. On restarting the mirror, these 'stuck' partitions may stay the same or move. If they move the backlog is drained very quickly. If we add more mirrors for additional capacity the same situation happens except that each mirror has it's own set of stuck partitions. I've included the mirror's configurations below along with samples from the logs. 1) The partition issue seems to happen when the mirror first starts up. Once in a blue moon it reads from everything normally, but on restart it can easily get back into this state. 2) We're fairly sure it isn't a processing/throughput issue. We can turn the mirror off for a while, incur a large backlog of data, and when it is enabled it chews through the data very quickly minus the handful of stuck partitions. 3) We've looked at both the zookeeper and broker logs and there doesn't seem to be anything out of the normal. We see the mirror connecting, there are a few info messages about zookeeper nodes already existing, etc. No specific errors. 4) We've enabled debugging on the mirror and we've noticed that during the zk heartbeat/updates we're missing these messages for the 'stuck' partitions: [2015-09-08 18:38:12,157] DEBUG Reading reply sessionid:0x14f956bd57d21ee, packet:: clientPath:null serverPath:null finished:false header:: 357,5 replyHeader:: 357,8597251893,0 request:: '/consumers/mirror-kafkablk-kafka-gold-east-to-kafkablk-den/offsets/MessageHeadersBody/5,#34303537353838,-1 response:: s{4295371756,8597251893,1439969185754,1441759092134,19500,0,0,0,7,0,4295371756} (org.apache.zookeeper.ClientCnxn) i.e. we see this message for all the processing partitions, but never for the stuck ones. There are no errors in the log prior to this though, and once in a great while we might see a log entry for one of the stuck partitions. 5) We've checked latency/response time with zookeeper from the brokers and the mirror and it appears fine. Mirror consumer config: group.id=mirror-kafkablk-kafka-gold-east-to-kafkablk-den consumer.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den zookeeper.connect=zk.strange.dev.net:2181 fetch.message.max.bytes=15728640 socket.receive.buffer.bytes=64000000 socket.timeout.ms=60000 zookeeper.connection.timeout.ms=60000 zookeeper.session.timeout.ms=30000 zookeeper.sync.time.ms=4000 auto.offset.reset=smallest auto.commit.interval.ms=20000 Mirror producer config: client.id=mirror-kafkablk-mirror00-den-kafka-gold-east-to-kafkablk-den metadata.broker.list=kafka00.lan.strange.dev.net:9092, kafka01.lan.strange.dev.net:9092,kafka02.lan.strange.dev.net:9092, kafka03.lan.strange.dev.net:9092,kafka04.lan.strange.dev.net:9092 request.required.acks=1 producer.type=async request.timeout.ms=20000 retry.backoff.ms=1000 message.send.max.retries=6 serializer.class=kafka.serializer.DefaultEncoder send.buffer.bytes=134217728 compression.codec=gzip Mirror startup settings: --num.streams 2 --num.producers 4 Any thoughts/suggestions would be very helpful. At this point we're running out of things to try. Craig J. Swift Software Engineer