[jira] [Commented] (KAFKA-6529) Broker leaks memory and file descriptors after sudden client disconnects

2020-04-02 Thread Graham Campbell (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074102#comment-17074102
 ] 

Graham Campbell commented on KAFKA-6529:


[~zzccctv] I originally reproduced it in a test environment by running many 
instances of the librdkafka performance test client against a broker to trigger 
the bug. If you're hitting this bug, the open sockets and open file descriptors 
of the broker process should grow over time due to the leaked sockets when 
clients disconnect while their requests are being processed.

_./rdkafka_performance -P -x 100 -b broker:9092 -t test_topic_

> Broker leaks memory and file descriptors after sudden client disconnects
> 
>
> Key: KAFKA-6529
> URL: https://issues.apache.org/jira/browse/KAFKA-6529
> Project: Kafka
>  Issue Type: Bug
>  Components: network
>Affects Versions: 0.11.0.2, 1.0.0
>Reporter: Graham Campbell
>Priority: Major
> Fix For: 0.11.0.3, 1.0.1, 1.1.0
>
>
> If a producer forcefully disconnects from a broker while it has staged 
> receives, that connection enters a limbo state where it is no longer 
> processed by the SocketServer.Processor, leaking the file descriptor for the 
> socket and the memory used for the staged recieve queue for that connection.
> We noticed this during an upgrade from 0.9.0.2 to 0.11.0.2. Immediately after 
> the rolling restart to upgrade, open file descriptors on the brokers started 
> climbing uncontrollably. In a few cases brokers reached our configured max 
> open files limit of 100k and crashed before we rolled back.
> We tracked this down to a buildup of muted connections in the 
> Selector.closingChannels list. If a client disconnects from the broker with 
> multiple pending produce requests, when the broker attempts to send an ack to 
> the client it recieves an IOException because the TCP socket has been closed. 
> This triggers the Selector to close the channel, but because it still has 
> pending requests, it adds it to Selector.closingChannels to process those 
> requests. However, because that exception was triggered by trying to send a 
> response, the SocketServer.Processor has marked the channel as muted and will 
> no longer process it at all.
> *Reproduced by:*
> Starting a Kafka broker/cluster
> Client produces several messages and then disconnects abruptly (eg. 
> _./rdkafka_performance -P -x 100 -b broker:9092 -t test_topic_)
> Broker then leaks file descriptor previously used for TCP socket and memory 
> for unprocessed messages
> *Proposed solution (which we've implemented internally)*
> Whenever an exception is encountered when writing to a socket in 
> Selector.pollSelectionKeys(...) record that that connection failed a send by 
> adding the KafkaChannel ID to Selector.failedSends. Then re-raise the 
> exception to still trigger the socket disconnection logic. Since every 
> exception raised in this function triggers a disconnect, we also treat any 
> exception while writing to the socket as a failed send.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8508) saslClient failed to initialize properly: it's null.

2020-01-02 Thread Graham Campbell (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007003#comment-17007003
 ] 

Graham Campbell commented on KAFKA-8508:


This specific ArrayIndexOutOfBoundsException is actually due to ZOOKEEPER-2323, 
which is fixed in ZK 3.5.2

> saslClient failed to initialize properly: it's null.
> 
>
> Key: KAFKA-8508
> URL: https://issues.apache.org/jira/browse/KAFKA-8508
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.0.1
>Reporter: Caroline Liu
>Priority: Major
>
> After a network issue caused the last ISR to fail connecting to ZooKeeper, 
> the attempt to reconnect failed with an ArrayIndexOutOfBoundsException. 
> {code:java}
> 2019-05-31 15:54:38,823 [zk-session-expiry-handler0-SendThread(zk2-1:2181)] 
> WARN (org.apache.zookeeper.ClientCnxn) - Client session timed out, have not 
> heard from server in 20010ms for sessionid 0x1511b2b1042a
> 2019-05-31 15:54:38,823 [zk-session-expiry-handler0-SendThread(zk2-1:2181)] 
> INFO (org.apache.zookeeper.ClientCnxn) - Client session timed out, have not 
> heard from server in 20010ms for sessionid 0x1511b2b1042a, closing socket 
> connection and attempting reconnect
> 2019-05-31 15:54:39,702 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> INFO (org.apache.zookeeper.client.ZooKeeperSaslClient) - Client will use 
> DIGEST-MD5 as SASL mechanism.
> 2019-05-31 15:54:39,702 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> ERROR (org.apache.zookeeper.client.ZooKeeperSaslClient) - Exception while 
> trying to create SASL client: java.lang.ArrayIndexOutOfBoundsException: 0
> 2019-05-31 15:54:39,702 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> INFO (org.apache.zookeeper.ClientCnxn) - Opening socket connection to server 
> zk1-2/1.3.6.1:2181. Will attempt to SASL-authenticate using Login Context 
> section 'Client'
> 2019-05-31 15:54:39,702 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> INFO (org.apache.zookeeper.ClientCnxn) - Socket connection established to 
> zk1-2/1.3.6.1:2181, initiating session
> 2019-05-31 15:54:39,703 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> INFO (org.apache.zookeeper.ClientCnxn) - Session establishment complete on 
> server zk1-2/1.3.6.1:2181, sessionid = 0x1511b2b1042a, negotiated timeout 
> = 3
> 2019-05-31 15:54:39,703 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> ERROR (org.apache.zookeeper.ClientCnxn) - SASL authentication with Zookeeper 
> Quorum member failed: javax.security.sasl.SaslException: saslClient failed to 
> initialize properly: it's null.{code}
> Kafka was "not live" in zookeeper and had to be manually restarted to recover 
> from this error. It would be better if the last ISR could retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7987) a broker's ZK session may die on transient auth failure

2019-11-05 Thread Graham Campbell (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967790#comment-16967790
 ] 

Graham Campbell commented on KAFKA-7987:


[~junrao] We're starting to see these errors more frequently (guess our network 
is getting less reliable), so I'm looking at a fix for this. Does scheduling a 
reinitialize() in the ZookeeperClient after notifying handlers seem like a 
reasonable solution? I don't see a way to get more details from the ZK client 
to try to tell if the auth failure was caused by a retriable error or not.

> a broker's ZK session may die on transient auth failure
> ---
>
> Key: KAFKA-7987
> URL: https://issues.apache.org/jira/browse/KAFKA-7987
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Priority: Major
>
> After a transient network issue, we saw the following log in a broker.
> {code:java}
> [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: 
> javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Server not found in Kerberos database (7))]) occurred when 
> evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client 
> will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn)
> [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. 
> (kafka.zookeeper.ZooKeeperClient)
> {code}
> The network issue prevented the broker from communicating to ZK. The broker's 
> ZK session then expired, but the broker didn't know that yet since it 
> couldn't establish a connection to ZK. When the network was back, the broker 
> tried to establish a connection to ZK, but failed due to auth failure (likely 
> due to a transient KDC issue). The current logic just ignores the auth 
> failure without trying to create a new ZK session. Then the broker will be 
> permanently in a state that it's alive, but not registered in ZK.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8508) saslClient failed to initialize properly: it's null.

2019-11-04 Thread Graham Campbell (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967135#comment-16967135
 ] 

Graham Campbell commented on KAFKA-8508:


ArrayOutOfBoundsException is due to mistakenly falling back from Kerberos to 
DIGEST-MD5 mechanism because retrieving the Kerberos ticket took too long, so 
it wasn't available to SecurityUtils.java. Root cause of that was a transient 
network issue, same as KAFKA-7987.

> saslClient failed to initialize properly: it's null.
> 
>
> Key: KAFKA-8508
> URL: https://issues.apache.org/jira/browse/KAFKA-8508
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 2.0.1
>Reporter: Caroline Liu
>Priority: Major
>
> After a network issue caused the last ISR to fail connecting to ZooKeeper, 
> the attempt to reconnect failed with an ArrayIndexOutOfBoundsException. 
> {code:java}
> 2019-05-31 15:54:38,823 [zk-session-expiry-handler0-SendThread(zk2-1:2181)] 
> WARN (org.apache.zookeeper.ClientCnxn) - Client session timed out, have not 
> heard from server in 20010ms for sessionid 0x1511b2b1042a
> 2019-05-31 15:54:38,823 [zk-session-expiry-handler0-SendThread(zk2-1:2181)] 
> INFO (org.apache.zookeeper.ClientCnxn) - Client session timed out, have not 
> heard from server in 20010ms for sessionid 0x1511b2b1042a, closing socket 
> connection and attempting reconnect
> 2019-05-31 15:54:39,702 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> INFO (org.apache.zookeeper.client.ZooKeeperSaslClient) - Client will use 
> DIGEST-MD5 as SASL mechanism.
> 2019-05-31 15:54:39,702 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> ERROR (org.apache.zookeeper.client.ZooKeeperSaslClient) - Exception while 
> trying to create SASL client: java.lang.ArrayIndexOutOfBoundsException: 0
> 2019-05-31 15:54:39,702 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> INFO (org.apache.zookeeper.ClientCnxn) - Opening socket connection to server 
> zk1-2/1.3.6.1:2181. Will attempt to SASL-authenticate using Login Context 
> section 'Client'
> 2019-05-31 15:54:39,702 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> INFO (org.apache.zookeeper.ClientCnxn) - Socket connection established to 
> zk1-2/1.3.6.1:2181, initiating session
> 2019-05-31 15:54:39,703 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> INFO (org.apache.zookeeper.ClientCnxn) - Session establishment complete on 
> server zk1-2/1.3.6.1:2181, sessionid = 0x1511b2b1042a, negotiated timeout 
> = 3
> 2019-05-31 15:54:39,703 [zk-session-expiry-handler0-SendThread(zk1-2:2181)] 
> ERROR (org.apache.zookeeper.ClientCnxn) - SASL authentication with Zookeeper 
> Quorum member failed: javax.security.sasl.SaslException: saslClient failed to 
> initialize properly: it's null.{code}
> Kafka was "not live" in zookeeper and had to be manually restarted to recover 
> from this error. It would be better if the last ISR could retry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9021) Broker shutdown during SSL handshake may be handled as handshake failure (Regression)

2019-10-10 Thread Graham Campbell (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949027#comment-16949027
 ] 

Graham Campbell commented on KAFKA-9021:


Issue priority based on previous instance in KAFKA-7168

> Broker shutdown during SSL handshake may be handled as handshake failure 
> (Regression)
> -
>
> Key: KAFKA-9021
> URL: https://issues.apache.org/jira/browse/KAFKA-9021
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.0.1, 2.1.1, 2.2.1, 2.3.1
>Reporter: Graham Campbell
>Priority: Major
>
> [Java 11 throws an 
> SSLProtocolException|http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/sun/security/ssl/Alert.java#l125]
>  when a SSL connection is gracefully closed during handshaking instead of an 
> SSLException. This breaks the fix for KAFKA-7168 and the client may process 
> the resulting SSLException as a non-retriable handshake failure rather than a 
> retriable I/O exception.
>  
> {code:java}
> org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake 
> failed
> Caused by: javax.net.ssl.SSLProtocolException: Received close_notify during 
> handshake
>  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:126)
>  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
>  at 
> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
>  at 
> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
>  at 
> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
>  at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:244)
>  at 
> java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:181)
>  at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:164)
>  at java.base/sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:672)
>  at 
> java.base/sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:627)
>  at java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:443)
>  at java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:422)
>  at java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:634)
>  at 
> org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:474)
>  at 
> org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:337)
>  at 
> org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:264)
>  at 
> org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
>  at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:489)
>  at org.apache.kafka.common.network.Selector.poll(Selector.java:427)
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:271)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1639)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1593)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9021) Broker shutdown during SSL handshake may be handled as handshake failure (Regression)

2019-10-10 Thread Graham Campbell (Jira)
Graham Campbell created KAFKA-9021:
--

 Summary: Broker shutdown during SSL handshake may be handled as 
handshake failure (Regression)
 Key: KAFKA-9021
 URL: https://issues.apache.org/jira/browse/KAFKA-9021
 Project: Kafka
  Issue Type: Bug
  Components: security
Affects Versions: 2.2.1, 2.1.1, 2.0.1, 2.3.1
Reporter: Graham Campbell


[Java 11 throws an 
SSLProtocolException|http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/sun/security/ssl/Alert.java#l125]
 when a SSL connection is gracefully closed during handshaking instead of an 
SSLException. This breaks the fix for KAFKA-7168 and the client may process the 
resulting SSLException as a non-retriable handshake failure rather than a 
retriable I/O exception.

 
{code:java}
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: javax.net.ssl.SSLProtocolException: Received close_notify during 
handshake
 at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:126)
 at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
 at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
 at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
 at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
 at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:244)
 at 
java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:181)
 at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:164)
 at java.base/sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:672)
 at java.base/sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:627)
 at java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:443)
 at java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:422)
 at java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:634)
 at 
org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:474)
 at 
org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:337)
 at 
org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:264)
 at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:125)
 at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:489)
 at org.apache.kafka.common.network.Selector.poll(Selector.java:427)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:271)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1639)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1593)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-6388) Error while trying to roll a segment that already exists

2018-02-08 Thread Graham Campbell (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357401#comment-16357401
 ] 

Graham Campbell commented on KAFKA-6388:


Most of the recent times we've run into this it's been on non-compacted topics 
that have been idle for a while (no data in for > retention.ms) and then begin 
receiving data again. It's not happening to every replica, with 15 partitions 
and 3 replicas sometimes one or two followers will encounter this for a given 
topic.

> Error while trying to roll a segment that already exists
> 
>
> Key: KAFKA-6388
> URL: https://issues.apache.org/jira/browse/KAFKA-6388
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: David Hay
>Priority: Blocker
>
> Recreating this issue from KAFKA-654 as we've been hitting it repeatedly in 
> our attempts to get a stable 1.0 cluster running (upgrading from 0.8.2.2).
> After spending 30 min or more spewing log messages like this:
> {noformat}
> [2017-12-19 16:44:28,998] INFO Replica loaded for partition 
> screening.save.results.screening.save.results.processor.error-43 with initial 
> high watermark 0 (kafka.cluster.Replica)
> {noformat}
> Eventually, the replica thread throws the error below (also referenced in the 
> original issue).  If I remove that partition from the data directory and 
> bounce the broker, it eventually rebalances (assuming it doesn't hit a 
> different partition with the same error).
> {noformat}
> 2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.log already exists; deleting it first (kafka.log.Log)
> [2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.index already exists; deleting it first (kafka.log.Log)
> [2017-12-19 15:16:24,227] WARN Newly rolled segment file 
> 0002.timeindex already exists; deleting it first 
> (kafka.log.Log)
> [2017-12-19 15:16:24,232] INFO [ReplicaFetcherManager on broker 2] Removed 
> fetcher for partitions __consumer_offsets-20 
> (kafka.server.ReplicaFetcherManager)
> [2017-12-19 15:16:24,297] ERROR [ReplicaFetcher replicaId=2, leaderId=1, 
> fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
> kafka.common.KafkaException: Error processing data for partition 
> sr.new.sr.new.processor.error-38 offset 2
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:172)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
> at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167)
> at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
> Caused by: kafka.common.KafkaException: Trying to roll a new log segment for 
> topic partition sr.new.sr.new.processor.error-38 with start offset 2 while it 
> already exists.
> at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1338)
> at kafka.log.Log$$anonfun$roll$2.apply(Log.scala:1297)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.roll(Log.scala:1297)
> at kafka.log.Log.kafka$log$Log$$maybeRoll(Log.scala:1284)
> at kafka.log.Log$$anonfun$append$2.apply(Log.scala:710)
> at kafka.log.Log$$anonfun$append$2.apply(Log.scala:624)
> at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
> at kafka.log.Log.append(Log.scala:624)
> at kafka.log.Log.appendAsFollower(Log.scala:607)
> at 
> kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:102)
> at 
> 

[jira] [Updated] (KAFKA-6529) Broker leaks memory and file descriptors after sudden client disconnects

2018-02-02 Thread Graham Campbell (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Graham Campbell updated KAFKA-6529:
---
Flags: Patch

> Broker leaks memory and file descriptors after sudden client disconnects
> 
>
> Key: KAFKA-6529
> URL: https://issues.apache.org/jira/browse/KAFKA-6529
> Project: Kafka
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.0.0, 0.11.0.2
>Reporter: Graham Campbell
>Priority: Major
>
> If a producer forcefully disconnects from a broker while it has staged 
> receives, that connection enters a limbo state where it is no longer 
> processed by the SocketServer.Processor, leaking the file descriptor for the 
> socket and the memory used for the staged recieve queue for that connection.
> We noticed this during an upgrade from 0.9.0.2 to 0.11.0.2. Immediately after 
> the rolling restart to upgrade, open file descriptors on the brokers started 
> climbing uncontrollably. In a few cases brokers reached our configured max 
> open files limit of 100k and crashed before we rolled back.
> We tracked this down to a buildup of muted connections in the 
> Selector.closingChannels list. If a client disconnects from the broker with 
> multiple pending produce requests, when the broker attempts to send an ack to 
> the client it recieves an IOException because the TCP socket has been closed. 
> This triggers the Selector to close the channel, but because it still has 
> pending requests, it adds it to Selector.closingChannels to process those 
> requests. However, because that exception was triggered by trying to send a 
> response, the SocketServer.Processor has marked the channel as muted and will 
> no longer process it at all.
> *Reproduced by:*
> Starting a Kafka broker/cluster
> Client produces several messages and then disconnects abruptly (eg. 
> _./rdkafka_performance -P -x 100 -b broker:9092 -t test_topic_)
> Broker then leaks file descriptor previously used for TCP socket and memory 
> for unprocessed messages
> *Proposed solution (which we've implemented internally)*
> Whenever an exception is encountered when writing to a socket in 
> Selector.pollSelectionKeys(...) record that that connection failed a send by 
> adding the KafkaChannel ID to Selector.failedSends. Then re-raise the 
> exception to still trigger the socket disconnection logic. Since every 
> exception raised in this function triggers a disconnect, we also treat any 
> exception while writing to the socket as a failed send.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-6529) Broker leaks memory and file descriptors after sudden client disconnects

2018-02-02 Thread Graham Campbell (JIRA)
Graham Campbell created KAFKA-6529:
--

 Summary: Broker leaks memory and file descriptors after sudden 
client disconnects
 Key: KAFKA-6529
 URL: https://issues.apache.org/jira/browse/KAFKA-6529
 Project: Kafka
  Issue Type: Bug
  Components: network
Affects Versions: 0.11.0.2, 1.0.0
Reporter: Graham Campbell


If a producer forcefully disconnects from a broker while it has staged 
receives, that connection enters a limbo state where it is no longer processed 
by the SocketServer.Processor, leaking the file descriptor for the socket and 
the memory used for the staged recieve queue for that connection.

We noticed this during an upgrade from 0.9.0.2 to 0.11.0.2. Immediately after 
the rolling restart to upgrade, open file descriptors on the brokers started 
climbing uncontrollably. In a few cases brokers reached our configured max open 
files limit of 100k and crashed before we rolled back.

We tracked this down to a buildup of muted connections in the 
Selector.closingChannels list. If a client disconnects from the broker with 
multiple pending produce requests, when the broker attempts to send an ack to 
the client it recieves an IOException because the TCP socket has been closed. 
This triggers the Selector to close the channel, but because it still has 
pending requests, it adds it to Selector.closingChannels to process those 
requests. However, because that exception was triggered by trying to send a 
response, the SocketServer.Processor has marked the channel as muted and will 
no longer process it at all.

*Reproduced by:*
Starting a Kafka broker/cluster
Client produces several messages and then disconnects abruptly (eg. 
_./rdkafka_performance -P -x 100 -b broker:9092 -t test_topic_)
Broker then leaks file descriptor previously used for TCP socket and memory for 
unprocessed messages

*Proposed solution (which we've implemented internally)*
Whenever an exception is encountered when writing to a socket in 
Selector.pollSelectionKeys(...) record that that connection failed a send by 
adding the KafkaChannel ID to Selector.failedSends. Then re-raise the exception 
to still trigger the socket disconnection logic. Since every exception raised 
in this function triggers a disconnect, we also treat any exception while 
writing to the socket as a failed send.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)