[jira] [Comment Edited] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

2018-04-30 Thread Ashish (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458652#comment-16458652
 ] 

Ashish edited comment on KAFKA-3410 at 4/30/18 4:29 PM:


We had a similar issue with 0.10.2.1 version . Is there a fix for this issue in 
later versions ?

The broker process was still running and to fix we had to hard kill the broker 
and then start it .

Cluster is a very low volume with total 5 topics and 10 brokers .

 

This is what we saw:

 

ZK session expired on broker 1
Controller resigned .
Broker 1 re-registered itself in ZK
All the partitions on the broker shrinked to broker 1 ( for topic 
__consumer_offsets )
2018-04-30 03:35:01,018 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,26] from 9,1,5 to 1
2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Cached zkVersion [205] not equal to that 
in zookeeper, skip updating ISR
2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,36] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,36] from 10,6,1 to 1

Broker 1 said :

2018-04-30 03:35:19,164 INFO [log.Log:kafka-request-handler-4] Truncating log 
__consumer_offsets-46 to offset 146800439.

2018-04-30 03:35:20,020 FATAL 
[server.ReplicaFetcherThread:ReplicaFetcherThread-1-3] 
[ReplicaFetcherThread-1-3], Exiting because log truncation is not allowed for 
partition __consumer_offsets-46, Current leader 3's latest offset 146800434 is 
less than replica 1's latest offset 146800439

018-04-30 03:35:20,112 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
shutting down
2018-04-30 03:35:20,123 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Starting controlled shutdown
2018-04-30 03:35:26,183 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Controlled shutdown succeeded
2018-04-30 03:35:26,258 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], Shutting down
2018-04-30 03:35:26,802 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], Shutdown completed
2018-04-30 03:35:26,813 INFO [server.KafkaRequestHandlerPool:Thread-3] [Kafka 
Request Handler on Broker 1], shutting down
2018-04-30 03:35:29,088 INFO [coordinator.GroupCoordinator:executor-Heartbeat] 
[GroupCoordinator 1]: Group TEST_TOPIC_CG with generation 106 is now empty
2018-04-30 03:35:29,121 WARN [coordinator.GroupCoordinator:executor-Heartbeat] 
[GroupCoordinator 1]: Failed to write empty metadata for group TEST_TOPIC_CG: 
This is not the correct coordinator for this group.

 

Its was a controlled shutdown , however the broker java process was still 
running .

Does this suggests that leader has a older offset than a follower ?

we have :

controlled.shutdown.enable=true

unclean.leader.election.enable=false


was (Author: ashish6785):
We had a similar issue with 0.10.2.1 version . Is there a fix for this issue in 
later versions ?

The broker process was still running and to fix we had to hard kill the broker 
and then start it .

Cluster is a very low volume with total 5 topics and 10 brokers .

 

This is what we saw:

 
 # ZK session expired on broker 1
 # Controller resigned .
 # Broker 1 re-registered itself in ZK
 # All the partitions on the broker shrinked to broker 1 ( for topic 
__consumer_offsets )

2018-04-30 03:35:01,018 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,26] from 9,1,5 to 1
*2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Cached zkVersion [205] not equal to that 
in zookeeper, skip updating ISR*
2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,36] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,36] from 10,6,1 to 1

Broker 1 said :

2018-04-30 03:35:19,164 INFO  [log.Log:kafka-request-handler-4] Truncating log 
__consumer_offsets-46 to offset 146800439.

*2018-04-30 03:35:20,020 FATAL 
[server.ReplicaFetcherThread:ReplicaFetcherThread-1-3] 
[ReplicaFetcherThread-1-3], Exiting because log truncation is not allowed for 
partition __consumer_offsets-46, Current leader 3's latest offset 146800434 is 
less than replica 1's latest offset 146800439*

018-04-30 03:35:20,112 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
shutting down
2018-04-30 03:35:20,123 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Starting controlled shutdown
2018-04-30 03:35:26,183 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Controlled shutdown succeeded
2018-04-30 03:35:26,258 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], Shutting down
2018-04-30 03:35:26,802 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], 

[jira] [Commented] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

2018-04-30 Thread Ashish (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458652#comment-16458652
 ] 

Ashish commented on KAFKA-3410:
---

We had a similar issue with 0.10.2.1 version . Is there a fix for this issue in 
later versions ?

The broker process was still running and to fix we had to hard kill the broker 
and then start it .

Cluster is a very low volume with total 5 topics and 10 brokers .

 

This is what we saw:

 
 # ZK session expired on broker 1
 # Controller resigned .
 # Broker 1 re-registered itself in ZK
 # All the partitions on the broker shrinked to broker 1 ( for topic 
__consumer_offsets )

2018-04-30 03:35:01,018 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,26] from 9,1,5 to 1
*2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,26] on broker 1: Cached zkVersion [205] not equal to that 
in zookeeper, skip updating ISR*
2018-04-30 03:35:01,238 INFO [cluster.Partition:kafka-scheduler-0] Partition 
[__consumer_offsets,36] on broker 1: Shrinking ISR for partition 
[__consumer_offsets,36] from 10,6,1 to 1

Broker 1 said :

2018-04-30 03:35:19,164 INFO  [log.Log:kafka-request-handler-4] Truncating log 
__consumer_offsets-46 to offset 146800439.

*2018-04-30 03:35:20,020 FATAL 
[server.ReplicaFetcherThread:ReplicaFetcherThread-1-3] 
[ReplicaFetcherThread-1-3], Exiting because log truncation is not allowed for 
partition __consumer_offsets-46, Current leader 3's latest offset 146800434 is 
less than replica 1's latest offset 146800439*

018-04-30 03:35:20,112 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
shutting down
2018-04-30 03:35:20,123 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Starting controlled shutdown
2018-04-30 03:35:26,183 INFO [server.KafkaServer:Thread-3] [Kafka Server 1], 
Controlled shutdown succeeded
2018-04-30 03:35:26,258 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], Shutting down
2018-04-30 03:35:26,802 INFO [network.SocketServer:Thread-3] [Socket Server on 
Broker 1], Shutdown completed
2018-04-30 03:35:26,813 INFO [server.KafkaRequestHandlerPool:Thread-3] [Kafka 
Request Handler on Broker 1], shutting down
2018-04-30 03:35:29,088 INFO [coordinator.GroupCoordinator:executor-Heartbeat] 
[GroupCoordinator 1]: Group CPRIME_MASTER_EVENT_PROD_CPRIME_CG with generation 
106 is now empty
2018-04-30 03:35:29,121 WARN [coordinator.GroupCoordinator:executor-Heartbeat] 
[GroupCoordinator 1]: Failed to write empty metadata for group 
CPRIME_MASTER_EVENT_PROD_CPRIME_CG: This is not the correct coordinator for 
this group.

 

Its was a controlled shutdown , however the broker java process was still 
running .

Does this suggests that leader has a older offset than a follower ?

we have :

controlled.shutdown.enable=true

unclean.leader.election.enable=false

> Unclean leader election and "Halting because log truncation is not allowed"
> ---
>
> Key: KAFKA-3410
> URL: https://issues.apache.org/jira/browse/KAFKA-3410
> Project: Kafka
>  Issue Type: Bug
>Reporter: James Cheng
>Priority: Major
>  Labels: reliability
>
> I ran into a scenario where one of my brokers would continually shutdown, 
> with the error message:
> [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because 
> log truncation is not allowed for topic test, Current leader 1's latest 
> offset 0 is less than replica 2's latest offset 151 
> (kafka.server.ReplicaFetcherThread)
> I managed to reproduce it with the following scenario:
> 1. Start broker1, with unclean.leader.election.enable=false
> 2. Start broker2, with unclean.leader.election.enable=false
> 3. Create topic, single partition, with replication-factor 2.
> 4. Write data to the topic.
> 5. At this point, both brokers are in the ISR. Broker1 is the partition 
> leader.
> 6. Ctrl-Z on broker2. (Simulates a GC pause or a slow network) Broker2 gets 
> dropped out of ISR. Broker1 is still the leader. I can still write data to 
> the partition.
> 7. Shutdown Broker1. Hard or controlled, doesn't matter.
> 8. rm -rf the log directory of broker1. (This simulates a disk replacement or 
> full hardware replacement)
> 9. Resume broker2. It attempts to connect to broker1, but doesn't succeed 
> because broker1 is down. At this point, the partition is offline. Can't write 
> to it.
> 10. Resume broker1. Broker1 resumes leadership of the topic. Broker2 attempts 
> to join ISR, and immediately halts with the error message:
> [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because 
> log truncation is not allowed for topic test, Current leader 1's latest 
> offset 0 is less than replica 2's latest offset 151 
>