One thing that you need to be aware is that if a broker goes down, the affected partitions will remain under replicated until the broker is restarted and catches up again.
Thanks, Jun On Tue, Jan 27, 2015 at 10:59 AM, Dong, John <zunhai.d...@ebay.com> wrote: > Hi, > > I am new to this forum and I am not sure this is the correct mailing list > for sending question. If not, please let me know and I will stop. > > I am looking for help to resolve replication issue. Replication stopped > working a while back. > > Kafka environment: Kafka 0.8.1.1, Centos 6.5, 7 node cluster, default > replication-factor 2, 10 partition per topic. > > Initially each partition is residing on two different nodes. It has been > this way for several months and working. Starting two weeks ago, two things > happened. > > 1. one node's disk usage got to 100% and crashed kafka process. So we > had to delete some *.log and *.index and restarted kafka process. > 2. In another case, some other node's disk usage reached 90%. Someone > deleted some *.log and *.index files without shutting down kafka process. > This caused issues and kafka was unable to restarted. I had to delete all > *.log and *.index on this node to bring kafka back online. > > Now replication is all broken. Most of the partition has only one leader > and one in ISR, even though replication is setup with two broker ids. > Whenever I shutdown kafka process on a node, whatever leader running on > this node will get moved to another node that is defined in replication. > After I restart kafka on this node, it will never become a follower and its > data directory never get updated. > > I tried the following: > > > 1. I had turned on TRACE/DEBUG level with kafka and zookeeper. I did > not find anything that can help. > 2. I also tried to manipulate replication configuration in zookeeper > using zkCLI.sh, like adding a follower to ISR list. That did not initiate a > fether process to make itself become a follower. > 3. I also created new topic with replication working initially. But as > soon as I shutdown kafka on one of its two nodes, that partition loses one > replica in ISR and never come back. This confirms that it is reproducible. > 4. I ran kafka preferred replication election tool to force re-election > of leader. That did not do anything. It is like nothing happen to the > cluster. > 5. I added num.replica.fetchers=10 to server.properties and restarted > kakfa. That did not do anything. > > Has anyone have any experience with this ? Or any advice where to look and > what the next steps are for trouble-shooting ? There are only two things > that I may have to do. > > > 1. Shutdown all kafka and zookeeper and restart them. I really do not > want to go this route unless I have to. I would like to identify the root > cause of it and not to randomly restart the whole cluster. > 2. Move all topics to another kafka cluster, and rebuild it. This will > be very time consuming and a lot of changes in the application. > > Thanks. > > John Dong >