I'm running a Kafka cluster on 3 EC2 instances. Each instance runs kafka
(0.11.0.1) and zookeeper (3.4). My topics are configured so that each has
20 partitions and ReplicationFactor of 3.

Today I noticed that some partitions refuse to sync to all three nodes.
Here's an example:

bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181"
> --describe --topic prod-decline
> Topic:prod-titan-decline    PartitionCount:20    ReplicationFactor:3
> Configs:
>     Topic: prod-decline    Partition: 0    Leader: 2    Replicas: 1,2,0
> Isr: 2
>     Topic: prod-decline    Partition: 1    Leader: 2    Replicas: 2,0,1
> Isr: 2
>     Topic: prod-decline    Partition: 2    Leader: 0    Replicas: 0,1,2
> Isr: 2,0,1
>     Topic: prod-decline    Partition: 3    Leader: 1    Replicas: 1,0,2
> Isr: 2,0,1
>     Topic: prod-decline    Partition: 4    Leader: 2    Replicas: 2,1,0
> Isr: 2
>     Topic: prod-decline    Partition: 5    Leader: 2    Replicas: 0,2,1
> Isr: 2
>     Topic: prod-decline    Partition: 6    Leader: 2    Replicas: 1,2,0
> Isr: 2
>     Topic: prod-decline    Partition: 7    Leader: 2    Replicas: 2,0,1
> Isr: 2
>     Topic: prod-decline    Partition: 8    Leader: 0    Replicas: 0,1,2
> Isr: 2,0,1
>     Topic: prod-decline    Partition: 9    Leader: 1    Replicas: 1,0,2
> Isr: 2,0,1
>     Topic: prod-decline    Partition: 10    Leader: 2    Replicas: 2,1,0
>   Isr: 2
>     Topic: prod-decline    Partition: 11    Leader: 2    Replicas: 0,2,1
>   Isr: 2
>     Topic: prod-decline    Partition: 12    Leader: 2    Replicas: 1,2,0
>   Isr: 2
>     Topic: prod-decline    Partition: 13    Leader: 2    Replicas: 2,0,1
>   Isr: 2
>     Topic: prod-decline    Partition: 14    Leader: 0    Replicas: 0,1,2
>   Isr: 2,0,1
>     Topic: prod-decline    Partition: 15    Leader: 1    Replicas: 1,0,2
>   Isr: 2,0,1
>     Topic: prod-decline    Partition: 16    Leader: 2    Replicas: 2,1,0
>   Isr: 2
>     Topic: prod-decline    Partition: 17    Leader: 2    Replicas: 0,2,1
>   Isr: 2
>     Topic: prod-decline    Partition: 18    Leader: 2    Replicas: 1,2,0
>   Isr: 2
>     Topic: prod-decline    Partition: 19    Leader: 2    Replicas: 2,0,1
>   Isr: 2


Only node 2 has all the data in-sync. I've tried restarting brokers 0 and 1
but it didn't improve the situation - it made it even worse. I'm tempted to
restart node 2 but I'm assuming it will lead to downtime or cluster failure
so I'd like to avoid it if possible.

I'm not seeing any obvious errors in logs so I'm having a hard time
figuring out how to debug the situation. Any tips would be greatly
appreciated.

Thanks!

-- 
Nace Oroz

Reply via email to