I'm running a Kafka cluster on 3 EC2 instances. Each instance runs kafka (0.11.0.1) and zookeeper (3.4). My topics are configured so that each has 20 partitions and ReplicationFactor of 3.
Today I noticed that some partitions refuse to sync to all three nodes. Here's an example: bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181" > --describe --topic prod-decline > Topic:prod-titan-decline PartitionCount:20 ReplicationFactor:3 > Configs: > Topic: prod-decline Partition: 0 Leader: 2 Replicas: 1,2,0 > Isr: 2 > Topic: prod-decline Partition: 1 Leader: 2 Replicas: 2,0,1 > Isr: 2 > Topic: prod-decline Partition: 2 Leader: 0 Replicas: 0,1,2 > Isr: 2,0,1 > Topic: prod-decline Partition: 3 Leader: 1 Replicas: 1,0,2 > Isr: 2,0,1 > Topic: prod-decline Partition: 4 Leader: 2 Replicas: 2,1,0 > Isr: 2 > Topic: prod-decline Partition: 5 Leader: 2 Replicas: 0,2,1 > Isr: 2 > Topic: prod-decline Partition: 6 Leader: 2 Replicas: 1,2,0 > Isr: 2 > Topic: prod-decline Partition: 7 Leader: 2 Replicas: 2,0,1 > Isr: 2 > Topic: prod-decline Partition: 8 Leader: 0 Replicas: 0,1,2 > Isr: 2,0,1 > Topic: prod-decline Partition: 9 Leader: 1 Replicas: 1,0,2 > Isr: 2,0,1 > Topic: prod-decline Partition: 10 Leader: 2 Replicas: 2,1,0 > Isr: 2 > Topic: prod-decline Partition: 11 Leader: 2 Replicas: 0,2,1 > Isr: 2 > Topic: prod-decline Partition: 12 Leader: 2 Replicas: 1,2,0 > Isr: 2 > Topic: prod-decline Partition: 13 Leader: 2 Replicas: 2,0,1 > Isr: 2 > Topic: prod-decline Partition: 14 Leader: 0 Replicas: 0,1,2 > Isr: 2,0,1 > Topic: prod-decline Partition: 15 Leader: 1 Replicas: 1,0,2 > Isr: 2,0,1 > Topic: prod-decline Partition: 16 Leader: 2 Replicas: 2,1,0 > Isr: 2 > Topic: prod-decline Partition: 17 Leader: 2 Replicas: 0,2,1 > Isr: 2 > Topic: prod-decline Partition: 18 Leader: 2 Replicas: 1,2,0 > Isr: 2 > Topic: prod-decline Partition: 19 Leader: 2 Replicas: 2,0,1 > Isr: 2 Only node 2 has all the data in-sync. I've tried restarting brokers 0 and 1 but it didn't improve the situation - it made it even worse. I'm tempted to restart node 2 but I'm assuming it will lead to downtime or cluster failure so I'd like to avoid it if possible. I'm not seeing any obvious errors in logs so I'm having a hard time figuring out how to debug the situation. Any tips would be greatly appreciated. Thanks! -- Nace Oroz