[jira] [Commented] (KAFKA-8815) Kafka broker blocked on I/O primitive

2020-05-16 Thread Alexandre Dupriez (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108917#comment-17108917
 ] 

Alexandre Dupriez commented on KAFKA-8815:
--

Hi [~william_reynolds]. In this case, the problem was an I/O stall. Due to 
stuck progress on the I/O block layer, the kernel freezes the file system and 
prevents any progress on `read`/`write`.
After some time, a kernel panic leads to a hard failure of every read/write.

What was your environment (kernel and block device)?

> Kafka broker blocked on I/O primitive
> -
>
> Key: KAFKA-8815
> URL: https://issues.apache.org/jira/browse/KAFKA-8815
> Project: Kafka
>  Issue Type: Bug
>  Components: core, log
>Affects Versions: 1.1.1
>Reporter: Alexandre Dupriez
>Priority: Major
>
> This JIRA is for tracking a problem we run into on a production cluster.
> *Scenario*
> Cluster of 15 brokers and an average ingress throughput of ~4 MB/s and egress 
> of ~4 MB/s per broker.
> Brokers are running on OpenJDK 8. They are configured with a heap size of 1 
> GB.
> There is around ~1,000 partition replicas per broker. Load is evenly 
> balanced. Each broker instance is under fair CPU load, but not overloaded 
> (50-60%). G1 is used for garbage collection and doesn't exhibit any pressure, 
> with mostly short young GC observed and an heap-after-GC usage of 70%.
> Replication factor is 3.
> *Symptom*
> One broker on the cluster suddenly became "unresponsive". Other brokers, 
> Zookeeper and producers/consumers requests were failing with timeouts. The 
> Kafka process, however, was still alive and doing some background work 
> (truncating logs and rolling segments) This lasted for hours. At some point, 
> several thread dumps were taken at few minutes interval. Most of the threads 
> were "blocked". Deadlock was ruled out. The most suspicious stack is the 
> following 
> {code:java}
> Thread 7801: (state = BLOCKED)
>  - sun.nio.ch.FileChannelImpl.write(java.nio.ByteBuffer) @bci=25, line=202 
> (Compiled frame)
>  - 
> org.apache.kafka.common.record.MemoryRecords.writeFullyTo(java.nio.channels.GatheringByteChannel)
>  @bci=24, line=93 (Compiled frame)
>  - 
> org.apache.kafka.common.record.FileRecords.append(org.apache.kafka.common.record.MemoryRecords)
>  @bci=5, line=152 (Compiled frame)
>  - kafka.log.LogSegment.append(long, long, long, long, 
> org.apache.kafka.common.record.MemoryRecords) @bci=82, line=136 (Compiled 
> frame)
>  - kafka.log.Log.$anonfun$append$2(kafka.log.Log, 
> org.apache.kafka.common.record.MemoryRecords, boolean, boolean, int, 
> java.lang.Object) @bci=1080, line=757 (Compiled frame)
>  - kafka.log.Log$$Lambda$614.apply() @bci=24 (Compiled frame)
>  - kafka.log.Log.maybeHandleIOException(scala.Function0, scala.Function0) 
> @bci=1, line=1696 (Compiled frame)
>  - kafka.log.Log.append(org.apache.kafka.common.record.MemoryRecords, 
> boolean, boolean, int) @bci=29, line=642 (Compiled frame)
>  - kafka.log.Log.appendAsLeader(org.apache.kafka.common.record.MemoryRecords, 
> int, boolean) @bci=5, line=612 (Compiled frame)
>  - 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(kafka.cluster.Partition,
>  org.apache.kafka.common.record.MemoryRecords, boolean, int) @bci=148, 
> line=609 (Compiled frame)
>  - kafka.cluster.Partition$$Lambda$837.apply() @bci=16 (Compiled frame)
>  - kafka.utils.CoreUtils$.inLock(java.util.concurrent.locks.Lock, 
> scala.Function0) @bci=7, line=250 (Compiled frame)
>  - 
> kafka.utils.CoreUtils$.inReadLock(java.util.concurrent.locks.ReadWriteLock, 
> scala.Function0) @bci=8, line=256 (Compiled frame)
>  - 
> kafka.cluster.Partition.appendRecordsToLeader(org.apache.kafka.common.record.MemoryRecords,
>  boolean, int) @bci=16, line=597 (Compiled frame)
>  - 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(kafka.server.ReplicaManager,
>  boolean, boolean, short, scala.Tuple2) @bci=295, line=739 (Compiled frame)
>  - kafka.server.ReplicaManager$$Lambda$836.apply(java.lang.Object) @bci=20 
> (Compiled frame)
>  - scala.collection.TraversableLike.$anonfun$map$1(scala.Function1, 
> scala.collection.mutable.Builder, java.lang.Object) @bci=3, line=234 
> (Compiled frame)
>  - scala.collection.TraversableLike$$Lambda$14.apply(java.lang.Object) @bci=9 
> (Compiled frame)
>  - scala.collection.mutable.HashMap.$anonfun$foreach$1(scala.Function1, 
> scala.collection.mutable.DefaultEntry) @bci=16, line=138 (Compiled frame)
>  - scala.collection.mutable.HashMap$$Lambda$31.apply(java.lang.Object) @bci=8 
> (Compiled frame)
>  - scala.collection.mutable.HashTable.foreachEntry(scala.Function1) @bci=39, 
> line=236 (Compiled frame)
>  - 
> scala.collection.mutable.HashTable.foreachEntry$(scala.collection.mutable.HashTable,
>  scala.Function1) @bci=2, line=229 (Compiled frame)
>  - 

[jira] [Commented] (KAFKA-8815) Kafka broker blocked on I/O primitive

2019-11-21 Thread William Reynolds (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979742#comment-16979742
 ] 

William Reynolds commented on KAFKA-8815:
-

Hi Alexandre, I suspect we may have run into this but we didn't manage to get 
dumps like you did. Do you by any chance have the network in and out pattern 
after the blocking starts? Also what the network processor (type=SocketServer 
name=NetworkProcessorAvgIdlePercent) does after blocking?

> Kafka broker blocked on I/O primitive
> -
>
> Key: KAFKA-8815
> URL: https://issues.apache.org/jira/browse/KAFKA-8815
> Project: Kafka
>  Issue Type: Bug
>  Components: core, log
>Affects Versions: 1.1.1
>Reporter: Alexandre Dupriez
>Priority: Major
>
> This JIRA is for tracking a problem we run into on a production cluster.
> *Scenario*
> Cluster of 15 brokers and an average ingress throughput of ~4 MB/s and egress 
> of ~4 MB/s per broker.
> Brokers are running on OpenJDK 8. They are configured with a heap size of 1 
> GB.
> There is around ~1,000 partition replicas per broker. Load is evenly 
> balanced. Each broker instance is under fair CPU load, but not overloaded 
> (50-60%). G1 is used for garbage collection and doesn't exhibit any pressure, 
> with mostly short young GC observed and an heap-after-GC usage of 70%.
> Replication factor is 3.
> *Symptom*
> One broker on the cluster suddenly became "unresponsive". Other brokers, 
> Zookeeper and producers/consumers requests were failing with timeouts. The 
> Kafka process, however, was still alive and doing some background work 
> (truncating logs and rolling segments) This lasted for hours. At some point, 
> several thread dumps were taken at few minutes interval. Most of the threads 
> were "blocked". Deadlock was ruled out. The most suspicious stack is the 
> following 
> {code:java}
> Thread 7801: (state = BLOCKED)
>  - sun.nio.ch.FileChannelImpl.write(java.nio.ByteBuffer) @bci=25, line=202 
> (Compiled frame)
>  - 
> org.apache.kafka.common.record.MemoryRecords.writeFullyTo(java.nio.channels.GatheringByteChannel)
>  @bci=24, line=93 (Compiled frame)
>  - 
> org.apache.kafka.common.record.FileRecords.append(org.apache.kafka.common.record.MemoryRecords)
>  @bci=5, line=152 (Compiled frame)
>  - kafka.log.LogSegment.append(long, long, long, long, 
> org.apache.kafka.common.record.MemoryRecords) @bci=82, line=136 (Compiled 
> frame)
>  - kafka.log.Log.$anonfun$append$2(kafka.log.Log, 
> org.apache.kafka.common.record.MemoryRecords, boolean, boolean, int, 
> java.lang.Object) @bci=1080, line=757 (Compiled frame)
>  - kafka.log.Log$$Lambda$614.apply() @bci=24 (Compiled frame)
>  - kafka.log.Log.maybeHandleIOException(scala.Function0, scala.Function0) 
> @bci=1, line=1696 (Compiled frame)
>  - kafka.log.Log.append(org.apache.kafka.common.record.MemoryRecords, 
> boolean, boolean, int) @bci=29, line=642 (Compiled frame)
>  - kafka.log.Log.appendAsLeader(org.apache.kafka.common.record.MemoryRecords, 
> int, boolean) @bci=5, line=612 (Compiled frame)
>  - 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(kafka.cluster.Partition,
>  org.apache.kafka.common.record.MemoryRecords, boolean, int) @bci=148, 
> line=609 (Compiled frame)
>  - kafka.cluster.Partition$$Lambda$837.apply() @bci=16 (Compiled frame)
>  - kafka.utils.CoreUtils$.inLock(java.util.concurrent.locks.Lock, 
> scala.Function0) @bci=7, line=250 (Compiled frame)
>  - 
> kafka.utils.CoreUtils$.inReadLock(java.util.concurrent.locks.ReadWriteLock, 
> scala.Function0) @bci=8, line=256 (Compiled frame)
>  - 
> kafka.cluster.Partition.appendRecordsToLeader(org.apache.kafka.common.record.MemoryRecords,
>  boolean, int) @bci=16, line=597 (Compiled frame)
>  - 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(kafka.server.ReplicaManager,
>  boolean, boolean, short, scala.Tuple2) @bci=295, line=739 (Compiled frame)
>  - kafka.server.ReplicaManager$$Lambda$836.apply(java.lang.Object) @bci=20 
> (Compiled frame)
>  - scala.collection.TraversableLike.$anonfun$map$1(scala.Function1, 
> scala.collection.mutable.Builder, java.lang.Object) @bci=3, line=234 
> (Compiled frame)
>  - scala.collection.TraversableLike$$Lambda$14.apply(java.lang.Object) @bci=9 
> (Compiled frame)
>  - scala.collection.mutable.HashMap.$anonfun$foreach$1(scala.Function1, 
> scala.collection.mutable.DefaultEntry) @bci=16, line=138 (Compiled frame)
>  - scala.collection.mutable.HashMap$$Lambda$31.apply(java.lang.Object) @bci=8 
> (Compiled frame)
>  - scala.collection.mutable.HashTable.foreachEntry(scala.Function1) @bci=39, 
> line=236 (Compiled frame)
>  - 
> scala.collection.mutable.HashTable.foreachEntry$(scala.collection.mutable.HashTable,
>  scala.Function1) @bci=2, line=229 (Compiled frame)
>  - scala.collection.mutable.HashMap.foreachEntry(scala.Function1)