Kemal ERDEN created KAFKA-7888:
----------------------------------

             Summary: kafka cluster not recovering - Shrinking ISR from 14,13 
to 13 (kafka.cluster.Partition) continously
                 Key: KAFKA-7888
                 URL: https://issues.apache.org/jira/browse/KAFKA-7888
             Project: Kafka
          Issue Type: Bug
          Components: controller, replication, zkclient
    Affects Versions: 2.1.0
         Environment: using kafka_2.12-2.1.0

3 ZKs 3 Broker cluster, using 3 boxes (1 ZK and 1 broker on each box), 
default.replication factor: 2, 
offset replication factor was 1 when the error happened, increased to 2 after 
seeing this error by reassigning-partitions.
compression: default (producer) on broker but sending gzip from producers.

linux (redhat) etx4 kafka logs on single local disk
            Reporter: Kemal ERDEN
         Attachments: combined.log, producer.log

we're seeing the following repeating logs on our kafka cluster from time to 
time which seems to cause messages expiring on Producers and the cluster going 
into a non-recoverable state. The only fix seems to be to restart brokers.


 {{Shrinking ISR from 14,13 to 13 (kafka.cluster.Partition)}}
 {{Cached zkVersion [21] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)}}

 and later on the following log is repeated:

{{Got user-level KeeperException when processing sessionid:0xe046aa4f8e60000 
type:setData cxid:0x2df zxid:0xa000001fd txntype:-1 reqpath:n/a Error 
Path:/brokers/topics/ucTrade/partitions/6/state Error:KeeperErrorCode = 
BadVersion for /brokers/topics/ucTrade/partitions/6/state}}

We haven't interfered with any of the brokers/zookeepers whilst this happened.

I've attached a combined log which represents a combination of controller, 
server and state change logs from each broker (ids 13,14 and 15, log files have 
the suffix b13, b14, b15 respectively)

We have increased the heaps from 1g to 6g for the brokers and from 512m to 4g 
for the zookeepers since this happened but not sure if it is relevant. the ZK 
logs are unfortunately overwritten so can't provide those.

We produce varying message sizes but some messages are relatively large (6mb) 
but we use compression on the producers (set to gzip).

I've attached some logs from one of our producers as well.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to