[jira] [Comment Edited] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"
[ https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458652#comment-16458652 ] Ashish edited comment on KAFKA-3410 at 4/30/18 4:29 PM: We had a similar issue with 0.10.2.1 version . Is there a fix for this issue in later versions ? The broker process was still running and to fix we had to hard kill the broker and then start it . Cluster is a very low volume with total 5 topics and 10 brokers . This is what we saw: ZK session expired on broker 1 Controller resigned . Broker 1 re-registered itself in ZK All the partitions on the broker shrinked to broker 1 ( for topic __consumer_offsets ) 2018-04-30 03:35:01,018 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,26] on broker 1: Shrinking ISR for partition [__consumer_offsets,26] from 9,1,5 to 1 2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,26] on broker 1: Cached zkVersion [205] not equal to that in zookeeper, skip updating ISR 2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,36] on broker 1: Shrinking ISR for partition [__consumer_offsets,36] from 10,6,1 to 1 Broker 1 said : 2018-04-30 03:35:19,164 INFO [log.Log:kafka-request-handler-4] Truncating log __consumer_offsets-46 to offset 146800439. 2018-04-30 03:35:20,020 FATAL [server.ReplicaFetcherThread:ReplicaFetcherThread-1-3] [ReplicaFetcherThread-1-3], Exiting because log truncation is not allowed for partition __consumer_offsets-46, Current leader 3's latest offset 146800434 is less than replica 1's latest offset 146800439 018-04-30 03:35:20,112 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], shutting down 2018-04-30 03:35:20,123 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], Starting controlled shutdown 2018-04-30 03:35:26,183 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], Controlled shutdown succeeded 2018-04-30 03:35:26,258 INFO [network.SocketServer:Thread-3] [Socket Server on Broker 1], Shutting down 2018-04-30 03:35:26,802 INFO [network.SocketServer:Thread-3] [Socket Server on Broker 1], Shutdown completed 2018-04-30 03:35:26,813 INFO [server.KafkaRequestHandlerPool:Thread-3] [Kafka Request Handler on Broker 1], shutting down 2018-04-30 03:35:29,088 INFO [coordinator.GroupCoordinator:executor-Heartbeat] [GroupCoordinator 1]: Group TEST_TOPIC_CG with generation 106 is now empty 2018-04-30 03:35:29,121 WARN [coordinator.GroupCoordinator:executor-Heartbeat] [GroupCoordinator 1]: Failed to write empty metadata for group TEST_TOPIC_CG: This is not the correct coordinator for this group. Its was a controlled shutdown , however the broker java process was still running . Does this suggests that leader has a older offset than a follower ? we have : controlled.shutdown.enable=true unclean.leader.election.enable=false was (Author: ashish6785): We had a similar issue with 0.10.2.1 version . Is there a fix for this issue in later versions ? The broker process was still running and to fix we had to hard kill the broker and then start it . Cluster is a very low volume with total 5 topics and 10 brokers . This is what we saw: # ZK session expired on broker 1 # Controller resigned . # Broker 1 re-registered itself in ZK # All the partitions on the broker shrinked to broker 1 ( for topic __consumer_offsets ) 2018-04-30 03:35:01,018 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,26] on broker 1: Shrinking ISR for partition [__consumer_offsets,26] from 9,1,5 to 1 *2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,26] on broker 1: Cached zkVersion [205] not equal to that in zookeeper, skip updating ISR* 2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,36] on broker 1: Shrinking ISR for partition [__consumer_offsets,36] from 10,6,1 to 1 Broker 1 said : 2018-04-30 03:35:19,164 INFO [log.Log:kafka-request-handler-4] Truncating log __consumer_offsets-46 to offset 146800439. *2018-04-30 03:35:20,020 FATAL [server.ReplicaFetcherThread:ReplicaFetcherThread-1-3] [ReplicaFetcherThread-1-3], Exiting because log truncation is not allowed for partition __consumer_offsets-46, Current leader 3's latest offset 146800434 is less than replica 1's latest offset 146800439* 018-04-30 03:35:20,112 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], shutting down 2018-04-30 03:35:20,123 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], Starting controlled shutdown 2018-04-30 03:35:26,183 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], Controlled shutdown succeeded 2018-04-30 03:35:26,258 INFO [network.SocketServer:Thread-3] [Socket Server on Broker 1], Shutting down 2018-04-30 03:35:26,802 INFO [network.SocketServer:Thread-3] [Socket Server on Broker 1],
[jira] [Commented] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"
[ https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458652#comment-16458652 ] Ashish commented on KAFKA-3410: --- We had a similar issue with 0.10.2.1 version . Is there a fix for this issue in later versions ? The broker process was still running and to fix we had to hard kill the broker and then start it . Cluster is a very low volume with total 5 topics and 10 brokers . This is what we saw: # ZK session expired on broker 1 # Controller resigned . # Broker 1 re-registered itself in ZK # All the partitions on the broker shrinked to broker 1 ( for topic __consumer_offsets ) 2018-04-30 03:35:01,018 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,26] on broker 1: Shrinking ISR for partition [__consumer_offsets,26] from 9,1,5 to 1 *2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,26] on broker 1: Cached zkVersion [205] not equal to that in zookeeper, skip updating ISR* 2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition [__consumer_offsets,36] on broker 1: Shrinking ISR for partition [__consumer_offsets,36] from 10,6,1 to 1 Broker 1 said : 2018-04-30 03:35:19,164 INFO [log.Log:kafka-request-handler-4] Truncating log __consumer_offsets-46 to offset 146800439. *2018-04-30 03:35:20,020 FATAL [server.ReplicaFetcherThread:ReplicaFetcherThread-1-3] [ReplicaFetcherThread-1-3], Exiting because log truncation is not allowed for partition __consumer_offsets-46, Current leader 3's latest offset 146800434 is less than replica 1's latest offset 146800439* 018-04-30 03:35:20,112 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], shutting down 2018-04-30 03:35:20,123 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], Starting controlled shutdown 2018-04-30 03:35:26,183 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], Controlled shutdown succeeded 2018-04-30 03:35:26,258 INFO [network.SocketServer:Thread-3] [Socket Server on Broker 1], Shutting down 2018-04-30 03:35:26,802 INFO [network.SocketServer:Thread-3] [Socket Server on Broker 1], Shutdown completed 2018-04-30 03:35:26,813 INFO [server.KafkaRequestHandlerPool:Thread-3] [Kafka Request Handler on Broker 1], shutting down 2018-04-30 03:35:29,088 INFO [coordinator.GroupCoordinator:executor-Heartbeat] [GroupCoordinator 1]: Group CPRIME_MASTER_EVENT_PROD_CPRIME_CG with generation 106 is now empty 2018-04-30 03:35:29,121 WARN [coordinator.GroupCoordinator:executor-Heartbeat] [GroupCoordinator 1]: Failed to write empty metadata for group CPRIME_MASTER_EVENT_PROD_CPRIME_CG: This is not the correct coordinator for this group. Its was a controlled shutdown , however the broker java process was still running . Does this suggests that leader has a older offset than a follower ? we have : controlled.shutdown.enable=true unclean.leader.election.enable=false > Unclean leader election and "Halting because log truncation is not allowed" > --- > > Key: KAFKA-3410 > URL: https://issues.apache.org/jira/browse/KAFKA-3410 > Project: Kafka > Issue Type: Bug >Reporter: James Cheng >Priority: Major > Labels: reliability > > I ran into a scenario where one of my brokers would continually shutdown, > with the error message: > [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because > log truncation is not allowed for topic test, Current leader 1's latest > offset 0 is less than replica 2's latest offset 151 > (kafka.server.ReplicaFetcherThread) > I managed to reproduce it with the following scenario: > 1. Start broker1, with unclean.leader.election.enable=false > 2. Start broker2, with unclean.leader.election.enable=false > 3. Create topic, single partition, with replication-factor 2. > 4. Write data to the topic. > 5. At this point, both brokers are in the ISR. Broker1 is the partition > leader. > 6. Ctrl-Z on broker2. (Simulates a GC pause or a slow network) Broker2 gets > dropped out of ISR. Broker1 is still the leader. I can still write data to > the partition. > 7. Shutdown Broker1. Hard or controlled, doesn't matter. > 8. rm -rf the log directory of broker1. (This simulates a disk replacement or > full hardware replacement) > 9. Resume broker2. It attempts to connect to broker1, but doesn't succeed > because broker1 is down. At this point, the partition is offline. Can't write > to it. > 10. Resume broker1. Broker1 resumes leadership of the topic. Broker2 attempts > to join ISR, and immediately halts with the error message: > [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because > log truncation is not allowed for topic test, Current leader 1's latest > offset 0 is less than replica 2's latest offset 151 >