[jira] [Commented] (KAFKA-6565) Sudden unresponsiveness from broker + file descriptor leak

2018-02-19 Thread Noam Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369568#comment-16369568
 ] 

Noam Berman commented on KAFKA-6565:


just for closure, another non-production cluster which was still with the 
0.10.1.0 crashed just now, and this was the stack trace:

JVM version is 25.65-b01
Deadlock Detection:

Found one Java-level deadlock:
=

"group-metadata-manager-0":
 waiting to lock Monitor@0x7f634c0398b8 (Object@0xa66603c8, a 
java/util/LinkedList),
 which is held by "kafka-request-handler-7"
"kafka-request-handler-7":
 waiting to lock Monitor@0x7f6334062de8 (Object@0xab402e40, a 
kafka/coordinator/GroupMetadata),
 which is held by "group-metadata-manager-0"

Found a total of 1 deadlock.

 

looks like indeed a duplicate bug. thanks!

> Sudden unresponsiveness from broker + file descriptor leak
> --
>
> Key: KAFKA-6565
> URL: https://issues.apache.org/jira/browse/KAFKA-6565
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: 3 separate clusters running 0.10.1.0 with 5 nodes  + 5 
> zookeepers.
> clients are mostly java 0.10.2.1, and a few nodejs clients 
> (https://github.com/oleksiyk/kafka). 
> throughput per broker: ~3.5MB per second. each broker is a leader of ~2500 
> partitions. 
> Debian GNU/Linux 8.8 (jessie)
> java version “1.8.0_65”
> c4.4xlarge , 1500gb gp2 disks
>Reporter: Noam Berman
>Priority: Blocker
>
> Running 3 kafka clusters v0.10.1.0  on aws, 5 brokers each (+5 zookeepers), 
> in the past week about 2-3 times a day (on all clusters, no specific brokers) 
> we encounter a situation in which a single broker stops responding to all 
> requests from clients/other brokers. at this moment all produced messages 
> fail, and other brokers start writing this to log repeatedly:
> [2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c 
> (kafka.server.ReplicaFetcherThread)
>  java.io.IOException: Connection to 878 was disconnected before the response 
> was read
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
>  at 
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
>  at 
> kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
>  at 
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
>  at 
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
>  at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
>  
> producers (examples):
>  2018-02-13 21:30:51:010 thread=[kafka-producer-network-thread | 
> catalog-index-0] WARN  Got error produce response with correlation id 8147 on 
> topic-partition catalog-1, retrying (49 attempts left). Error: 
> REQUEST_TIMED_OUT
> 2018-02-13 21:31:06:221 thread=[kafka-producer-network-thread | 
> catalog-index-0] WARN  Got error produce response with correlation id 8166 on 
> topic-partition catalog-index-18, retrying (48 attempts left). Error: 
> NOT_LEADER_FOR_PARTITION
>  
> Once this happens, file descriptors on the affected broker start increasing 
> at a high rate until it reaches its maximum.
> no relevant info logs on the affected broker, we'll try to gather debug logs 
> and attach them to the bug next time it happens.
> the big issue here is that although the broker is unresponsive, it stays in 
> the cluster and its zookeeper node isn't cleared, so there's no rebalance 
> whatsoever - all clients start failing until a manual shutdown occurs.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6565) Sudden unresponsiveness from broker + file descriptor leak

2018-02-19 Thread Noam Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368896#comment-16368896
 ] 

Noam Berman commented on KAFKA-6565:


didn't happen again, and we upgraded to 0.10.2.1. the issue seems to be 
resolved. thanks!

> Sudden unresponsiveness from broker + file descriptor leak
> --
>
> Key: KAFKA-6565
> URL: https://issues.apache.org/jira/browse/KAFKA-6565
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: 3 separate clusters running 0.10.1.0 with 5 nodes  + 5 
> zookeepers.
> clients are mostly java 0.10.2.1, and a few nodejs clients 
> (https://github.com/oleksiyk/kafka). 
> throughput per broker: ~3.5MB per second. each broker is a leader of ~2500 
> partitions. 
> Debian GNU/Linux 8.8 (jessie)
> java version “1.8.0_65”
> c4.4xlarge , 1500gb gp2 disks
>Reporter: Noam Berman
>Priority: Blocker
>
> Running 3 kafka clusters v0.10.1.0  on aws, 5 brokers each (+5 zookeepers), 
> in the past week about 2-3 times a day (on all clusters, no specific brokers) 
> we encounter a situation in which a single broker stops responding to all 
> requests from clients/other brokers. at this moment all produced messages 
> fail, and other brokers start writing this to log repeatedly:
> [2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c 
> (kafka.server.ReplicaFetcherThread)
>  java.io.IOException: Connection to 878 was disconnected before the response 
> was read
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
>  at 
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
>  at 
> kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
>  at 
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
>  at 
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
>  at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
>  
> producers (examples):
>  2018-02-13 21:30:51:010 thread=[kafka-producer-network-thread | 
> catalog-index-0] WARN  Got error produce response with correlation id 8147 on 
> topic-partition catalog-1, retrying (49 attempts left). Error: 
> REQUEST_TIMED_OUT
> 2018-02-13 21:31:06:221 thread=[kafka-producer-network-thread | 
> catalog-index-0] WARN  Got error produce response with correlation id 8166 on 
> topic-partition catalog-index-18, retrying (48 attempts left). Error: 
> NOT_LEADER_FOR_PARTITION
>  
> Once this happens, file descriptors on the affected broker start increasing 
> at a high rate until it reaches its maximum.
> no relevant info logs on the affected broker, we'll try to gather debug logs 
> and attach them to the bug next time it happens.
> the big issue here is that although the broker is unresponsive, it stays in 
> the cluster and its zookeeper node isn't cleared, so there's no rebalance 
> whatsoever - all clients start failing until a manual shutdown occurs.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6565) Sudden unresponsiveness from broker + file descriptor leak

2018-02-16 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366753#comment-16366753
 ] 

Rajini Sivaram commented on KAFKA-6565:
---

[~berman] Looks like this could be due to the deadlocks KAFKA-4399 or 
KAFKA-3994 which have been fixed in 0.10.1.1. To verify, would you be able to 
run `_kill -QUIT _` on an unresponsive broker when this happens 
again? Thank you!

> Sudden unresponsiveness from broker + file descriptor leak
> --
>
> Key: KAFKA-6565
> URL: https://issues.apache.org/jira/browse/KAFKA-6565
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: 3 separate clusters running 0.10.1.0 with 5 nodes  + 5 
> zookeepers.
> clients are mostly java 0.10.2.1, and a few nodejs clients 
> (https://github.com/oleksiyk/kafka). 
> throughput per broker: ~3.5MB per second. each broker is a leader of ~2500 
> partitions. 
> Debian GNU/Linux 8.8 (jessie)
> java version “1.8.0_65”
> c4.4xlarge , 1500gb gp2 disks
>Reporter: Noam Berman
>Priority: Blocker
>
> Running 3 kafka clusters v0.10.1.0  on aws, 5 brokers each (+5 zookeepers), 
> in the past week about 2-3 times a day (on all clusters, no specific brokers) 
> we encounter a situation in which a single broker stops responding to all 
> requests from clients/other brokers. at this moment all produced messages 
> fail, and other brokers start writing this to log repeatedly:
> [2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c 
> (kafka.server.ReplicaFetcherThread)
>  java.io.IOException: Connection to 878 was disconnected before the response 
> was read
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
>  at 
> kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
>  at 
> kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
>  at 
> kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
>  at 
> kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
>  at 
> kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
>  at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
>  
> producers (examples):
>  2018-02-13 21:30:51:010 thread=[kafka-producer-network-thread | 
> catalog-index-0] WARN  Got error produce response with correlation id 8147 on 
> topic-partition catalog-1, retrying (49 attempts left). Error: 
> REQUEST_TIMED_OUT
> 2018-02-13 21:31:06:221 thread=[kafka-producer-network-thread | 
> catalog-index-0] WARN  Got error produce response with correlation id 8166 on 
> topic-partition catalog-index-18, retrying (48 attempts left). Error: 
> NOT_LEADER_FOR_PARTITION
>  
> Once this happens, file descriptors on the affected broker start increasing 
> at a high rate until it reaches its maximum.
> no relevant info logs on the affected broker, we'll try to gather debug logs 
> and attach them to the bug next time it happens.
> the big issue here is that although the broker is unresponsive, it stays in 
> the cluster and its zookeeper node isn't cleared, so there's no rebalance 
> whatsoever - all clients start failing until a manual shutdown occurs.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)