Jurriaan Pruis created KAFKA-6582:
-------------------------------------

             Summary: Partitions get underreplicated, with a single ISR, and 
doesn't recover. Other brokers do not take over and we need to manually restart 
the broker.
                 Key: KAFKA-6582
                 URL: https://issues.apache.org/jira/browse/KAFKA-6582
             Project: Kafka
          Issue Type: Bug
          Components: network
    Affects Versions: 1.0.0
         Environment: Ubuntu 16.04
Linux kafka04 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

java version "9.0.1"
Java(TM) SE Runtime Environment (build 9.0.1+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode) 

but also tried with the latest JVM 8 before with the same result.
            Reporter: Jurriaan Pruis


Partitions get underreplicated, with a single ISR, and doesn't recover. Other 
brokers do not take over and we need to manually restart the 'single ISR' 
broker (if you describe the partitions of replicated topic it is clear that 
some partitions are only in sync on this broker).

This bug resembles KAFKA-4477 a lot, but since that issue is marked as resolved 
this is probably something else but similar.

We have the same issue (or at least it looks pretty similar) on Kafka 1.0. 

Since upgrading to Kafka 1.0 in November 2017 we've had these issues (we've 
upgraded from Kafka 0.10.2.1).

This happens almost every 24-48 hours on a random broker. This is why we 
currently have a cronjob which restarts every broker every 24 hours. 

During this issue the ISR shows the following server log: 
{code:java}
[2018-02-20 12:02:08,342] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.132.0.32:9092-10.14.148.20:56352-96708 (kafka.network.Processor)
[2018-02-20 12:02:08,364] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.132.0.32:9092-10.14.150.25:54412-96715 (kafka.network.Processor)
[2018-02-20 12:02:08,349] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.132.0.32:9092-10.14.149.18:35182-96705 (kafka.network.Processor)
[2018-02-20 12:02:08,379] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.132.0.32:9092-10.14.150.25:54456-96717 (kafka.network.Processor)
[2018-02-20 12:02:08,448] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.132.0.32:9092-10.14.159.20:36388-96720 (kafka.network.Processor)
[2018-02-20 12:02:08,683] WARN Attempting to send response via channel for 
which there is no open connection, connection id 
10.132.0.32:9092-10.14.157.110:41922-96740 (kafka.network.Processor)
{code}
Also on the ISR broker, the controller log shows this:
{code:java}
[2018-02-20 12:02:14,927] INFO [Controller-3-to-broker-3-send-thread]: 
Controller 3 connected to 10.132.0.32:9092 (id: 3 rack: null) for sending state 
change requests (kafka.controller.RequestSendThread)
[2018-02-20 12:02:14,927] INFO [Controller-3-to-broker-0-send-thread]: 
Controller 3 connected to 10.132.0.10:9092 (id: 0 rack: null) for sending state 
change requests (kafka.controller.RequestSendThread)
[2018-02-20 12:02:14,928] INFO [Controller-3-to-broker-1-send-thread]: 
Controller 3 connected to 10.132.0.12:9092 (id: 1 rack: null) for sending state 
change requests (kafka.controller.RequestSendThread){code}
And the non-ISR brokers show these kind of errors:

 
{code:java}
2018-02-20 12:02:29,204] WARN [ReplicaFetcher replicaId=1, leaderId=3, 
fetcherId=0] Error in fetch to broker 3, request (type=FetchRequest, 
replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={......................}, isolationLevel=READ_UNCOMMITTED) 
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was 
read
 at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:95)
 at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
 at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:205)
 at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:41)
 at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149)
 at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to