[ https://issues.apache.org/jira/browse/KAFKA-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
A. Sophie Blee-Goldman resolved KAFKA-8165. ------------------------------------------- Resolution: Fixed > Streams task causes Out Of Memory after connection issues and store > restoration > ------------------------------------------------------------------------------- > > Key: KAFKA-8165 > URL: https://issues.apache.org/jira/browse/KAFKA-8165 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.1.0 > Environment: 3 nodes, 22 topics, 16 partitions per topic, 1 window > store, 4 KV stores. > Kafka Streams application cluster: 3 AWS t2.large instances (8GB mem). 1 > application instance, 2 threads per instance. > Kafka 2.1, Kafka Streams 2.1 > Amazon Linux. > Scala application, on Docker based on openJdk9. > Reporter: Di Campo > Priority: Major > > Having a Kafka Streams 2.1 application, when Kafka brokers are stable, the > (largely stateful) application has been consuming ~160 messages per second at > a sustained rate for several hours. > However it started having connection issues to the brokers. > {code:java} > Connection to node 3 (/172.31.36.118:9092) could not be established. Broker > may not be available. (org.apache.kafka.clients.NetworkClient){code} > Also it began showing a lot of these errors: > {code:java} > WARN [Consumer > clientId=stream-processor-81e1ce17-1765-49f8-9b44-117f983a2d19-StreamThread-2-consumer, > groupId=stream-processor] 1 partitions have leader brokers without a > matching listener, including [broker-2-health-check-0] > (org.apache.kafka.clients.NetworkClient){code} > In fact, the _health-check_ topic is in the broker but not consumed by this > topology or used in any way by the Streams application (it is just broker > healthcheck). It does not complain about topics that are actually consumed by > the topology. > Some time after these errors (that appear at a rate of 24 appearances per > second during ~5 minutes), then the following logs appear: > {code:java} > [2019-03-27 15:14:47,709] WARN [Consumer > clientId=stream-processor-81e1ce17-1765-49f8-9b44-117f983a2d19-StreamThread-1-restore-consumer, > groupId=] Connection to node -3 (/ip3:9092) could not be established. Broker > may not be available. (org.apache.kafka.clients.NetworkClient){code} > In between 6 and then 3 lines of "Connection could not be established" error > messages, 3 of these ones slipped in: > {code:java} > [2019-03-27 15:14:47,723] WARN Started Restoration of visitorCustomerStore > partition 15 total records to be restored 17 > (com.divvit.dp.streams.applications.monitors.ConsoleGlobalRestoreListener){code} > > ... one for each different KV store I have (I still have another KV that > does not appear, and a WindowedStore store that also does not appear). > Then I finally see "Restoration Complete" (using a logging > ConsoleGlobalRestoreListener as in docs) messages for all of my stores. So it > seems it may be fine now to restart the processing. > Three minutes later, some events get processed, and I see an OOM error: > {code:java} > java.lang.OutOfMemoryError: GC overhead limit exceeded{code} > > ... so given that it usually allows to process during hours under same > circumstances, I'm wondering whether there is some memory leak in the > connection resources or somewhere in the handling of this scenario. > Kafka and KafkaStreams 2.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)