[ https://issues.apache.org/jira/browse/KAFKA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexandre Dupriez resolved KAFKA-8815. -------------------------------------- Resolution: Not A Problem System failure. Not related to Kafka. > Kafka broker blocked on I/O primitive > ------------------------------------- > > Key: KAFKA-8815 > URL: https://issues.apache.org/jira/browse/KAFKA-8815 > Project: Kafka > Issue Type: Bug > Components: core, log > Affects Versions: 1.1.1 > Reporter: Alexandre Dupriez > Priority: Major > > This JIRA is for tracking a problem we run into on a production cluster. > *Scenario* > Cluster of 15 brokers and an average ingress throughput of ~4 MB/s and egress > of ~4 MB/s per broker. > Brokers are running on OpenJDK 8. They are configured with a heap size of 1 > GB. > There is around ~1,000 partition replicas per broker. Load is evenly > balanced. Each broker instance is under fair CPU load, but not overloaded > (50-60%). G1 is used for garbage collection and doesn't exhibit any pressure, > with mostly short young GC observed and an heap-after-GC usage of 70%. > Replication factor is 3. > *Symptom* > One broker on the cluster suddenly became "unresponsive". Other brokers, > Zookeeper and producers/consumers requests were failing with timeouts. The > Kafka process, however, was still alive and doing some background work > (truncating logs and rolling segments) This lasted for hours. At some point, > several thread dumps were taken at few minutes interval. Most of the threads > were "blocked". Deadlock was ruled out. The most suspicious stack is the > following > {code:java} > Thread 7801: (state = BLOCKED) > - sun.nio.ch.FileChannelImpl.write(java.nio.ByteBuffer) @bci=25, line=202 > (Compiled frame) > - > org.apache.kafka.common.record.MemoryRecords.writeFullyTo(java.nio.channels.GatheringByteChannel) > @bci=24, line=93 (Compiled frame) > - > org.apache.kafka.common.record.FileRecords.append(org.apache.kafka.common.record.MemoryRecords) > @bci=5, line=152 (Compiled frame) > - kafka.log.LogSegment.append(long, long, long, long, > org.apache.kafka.common.record.MemoryRecords) @bci=82, line=136 (Compiled > frame) > - kafka.log.Log.$anonfun$append$2(kafka.log.Log, > org.apache.kafka.common.record.MemoryRecords, boolean, boolean, int, > java.lang.Object) @bci=1080, line=757 (Compiled frame) > - kafka.log.Log$$Lambda$614.apply() @bci=24 (Compiled frame) > - kafka.log.Log.maybeHandleIOException(scala.Function0, scala.Function0) > @bci=1, line=1696 (Compiled frame) > - kafka.log.Log.append(org.apache.kafka.common.record.MemoryRecords, > boolean, boolean, int) @bci=29, line=642 (Compiled frame) > - kafka.log.Log.appendAsLeader(org.apache.kafka.common.record.MemoryRecords, > int, boolean) @bci=5, line=612 (Compiled frame) > - > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(kafka.cluster.Partition, > org.apache.kafka.common.record.MemoryRecords, boolean, int) @bci=148, > line=609 (Compiled frame) > - kafka.cluster.Partition$$Lambda$837.apply() @bci=16 (Compiled frame) > - kafka.utils.CoreUtils$.inLock(java.util.concurrent.locks.Lock, > scala.Function0) @bci=7, line=250 (Compiled frame) > - > kafka.utils.CoreUtils$.inReadLock(java.util.concurrent.locks.ReadWriteLock, > scala.Function0) @bci=8, line=256 (Compiled frame) > - > kafka.cluster.Partition.appendRecordsToLeader(org.apache.kafka.common.record.MemoryRecords, > boolean, int) @bci=16, line=597 (Compiled frame) > - > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(kafka.server.ReplicaManager, > boolean, boolean, short, scala.Tuple2) @bci=295, line=739 (Compiled frame) > - kafka.server.ReplicaManager$$Lambda$836.apply(java.lang.Object) @bci=20 > (Compiled frame) > - scala.collection.TraversableLike.$anonfun$map$1(scala.Function1, > scala.collection.mutable.Builder, java.lang.Object) @bci=3, line=234 > (Compiled frame) > - scala.collection.TraversableLike$$Lambda$14.apply(java.lang.Object) @bci=9 > (Compiled frame) > - scala.collection.mutable.HashMap.$anonfun$foreach$1(scala.Function1, > scala.collection.mutable.DefaultEntry) @bci=16, line=138 (Compiled frame) > - scala.collection.mutable.HashMap$$Lambda$31.apply(java.lang.Object) @bci=8 > (Compiled frame) > - scala.collection.mutable.HashTable.foreachEntry(scala.Function1) @bci=39, > line=236 (Compiled frame) > - > scala.collection.mutable.HashTable.foreachEntry$(scala.collection.mutable.HashTable, > scala.Function1) @bci=2, line=229 (Compiled frame) > - scala.collection.mutable.HashMap.foreachEntry(scala.Function1) @bci=2, > line=40 (Compiled frame) > - scala.collection.mutable.HashMap.foreach(scala.Function1) @bci=7, line=138 > (Compiled frame) > - scala.collection.TraversableLike.map(scala.Function1, > scala.collection.generic.CanBuildFrom) @bci=14, line=234 (Compiled frame) > - scala.collection.TraversableLike.map$(scala.collection.TraversableLike, > scala.Function1, scala.collection.generic.CanBuildFrom) @bci=3, line=227 > (Compiled frame) > - scala.collection.AbstractTraversable.map(scala.Function1, > scala.collection.generic.CanBuildFrom) @bci=3, line=104 (Compiled frame) > - kafka.server.ReplicaManager.appendToLocalLog(boolean, boolean, > scala.collection.Map, short) @bci=27, line=723 (Compiled frame) > - kafka.server.ReplicaManager.appendRecords(long, short, boolean, boolean, > scala.collection.Map, scala.Function1, scala.Option, scala.Function1) > @bci=27, line=465 (Compiled frame) > - > kafka.server.KafkaApis.handleProduceRequest(kafka.network.RequestChannel$Request) > @bci=370, line=473 (Compiled frame) > - kafka.server.KafkaApis.handle(kafka.network.RequestChannel$Request) > @bci=30, line=104 (Compiled frame) > - kafka.server.KafkaRequestHandler.run() @bci=140, line=69 (Compiled frame) > - java.lang.Thread.run() @bci=11, line=748 (Interpreted frame) > {code} > The number of opened sockets and file descriptors had drastically increased > on the broker instance. > {code:java} > sh-4.2$ sudo lsof -P -p 7393 | grep REG | wc -l > 9681 > sh-4.2$ sudo lsof -P -p 7374 | grep ESTABLISHED | wc -l > 173390 > {code} > Compared to an healthy broker > {code:java} > h-4.2$ sudo lsof -P -p 7393 | grep REG | wc -l > 9681 > sh-4.2$ sudo lsof -P -p 7393 | grep ESTABLISHED | wc -l > 402 > {code} > There was no error/warning in the broker's logs. > I wonder if this type of behaviour has already been encountered - especially, > blocking indefinitely on an I/O write. -- This message was sent by Atlassian Jira (v8.3.4#803005)