[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2018-05-18 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480626#comment-16480626
 ] 

Robin Tweedie commented on KAFKA-6199:
--

Reading the 1.1 release notes, I suspect the fix was from KAFKA-6529 :D

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
>Priority: Major
> Fix For: 1.1.0
>
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_20171206.txt, histo_live_80.txt, jstack-2017-12-08.scrubbed.out, 
> merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2018-02-19 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369257#comment-16369257
 ] 

Robin Tweedie commented on KAFKA-6199:
--

There's another piece of evidence I've noticed about this single broker: it 
builds up file descriptors in a way that the other brokers don't. I'm not sure 
if this narrows the potential causes much.

On the problem broker, check out the high {{sock}} open files:
{noformat}
$ sudo lsof | awk '{print $5}' | sort | uniq -c | sort -rn
  12201 REG
   7229 IPv6
   1374 sock
    337 FIFO
    264 DIR
    163 CHR
    138 
 77 unknown
 54 unix
 13 IPv4
  1 TYPE
  1 pack
{noformat}

If you look at the {{lsof}} output directory there are lots of lines like this 
(25305 is the Kafka pid)
{noformat}
java  25305 user *105u sock0,6   0t0  351061533 
can't identify protocol
java  25305 user *111u sock0,6   0t0  351219556 
can't identify protocol
java  25305 user *131u sock0,6   0t0  350831689 
can't identify protocol
java  25305 user *134u sock0,6   0t0  351001514 
can't identify protocol
java  25305 user *136u sock0,6   0t0  351410956 
can't identify protocol
{noformat}

Compare with a good broker that has an uptime of 76 days (only 65 open {{sock}} 
files):
{noformat}
  11729 REG
   7037 IPv6
335 FIFO
264 DIR
164 CHR
137 
 76 unknown
 65 sock
 54 unix
 14 IPv4
  1 TYPE
  1 pack
{noformat}

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
>Priority: Major
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_20171206.txt, histo_live_80.txt, jstack-2017-12-08.scrubbed.out, 
> merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2018-02-19 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369241#comment-16369241
 ] 

Robin Tweedie commented on KAFKA-6199:
--

[~omkreddy] did you ever get a chance to look at the jstack output?

We are still restarting this broker daily if anyone else can think of something 
to look at.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
>Priority: Major
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_20171206.txt, histo_live_80.txt, jstack-2017-12-08.scrubbed.out, 
> merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-07 Thread Manikumar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282116#comment-16282116
 ] 

Manikumar commented on KAFKA-6199:
--

[~rt_skyscanner] yes, jstack output

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_20171206.txt, histo_live_80.txt, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-07 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282111#comment-16282111
 ] 

Robin Tweedie commented on KAFKA-6199:
--

[~omkreddy] Nothing obvious beyond the warnings I shared further up. I'll have 
another look.

When you say thread dump, just the output of {{jstack PID}}?

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_20171206.txt, histo_live_80.txt, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-07 Thread Manikumar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282041#comment-16282041
 ] 

Manikumar commented on KAFKA-6199:
--

Are there any exceptions in the  broker? Can you upload the thread dump of the 
problematic broker?

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_20171206.txt, histo_live_80.txt, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-04 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276973#comment-16276973
 ] 

Robin Tweedie commented on KAFKA-6199:
--

I'll try doubling the heap and post another histogram when it gets above 75% 
heap again. Many thanks for the help so far!

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_80.txt, merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-04 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276935#comment-16276935
 ] 

Ismael Juma commented on KAFKA-6199:


2 GB is a bit low, 6 GB is a typical choice for Kafka.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_80.txt, merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-04 Thread Manikumar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276917#comment-16276917
 ] 

Manikumar commented on KAFKA-6199:
--

histogram shows large number (~7000) of KafkaChannel/SocketChannelImpl 
instances. maybe we are connecting large number of clients or there is  a 
genuine KafkaChannel leak like KAFKA-6185.  Based on load on that broker,  You 
can try increasing the heap size.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_80.txt, merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-04 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276869#comment-16276869
 ] 

Ted Yu commented on KAFKA-6199:
---

Looking at the second histogram, JmxReporter$KafkaMbean didn't show up high.
I agree the real issue is somewhere else.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_80.txt, merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-04 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276732#comment-16276732
 ] 

Robin Tweedie commented on KAFKA-6199:
--

[~ijuma] we're running with {{-Xmx2g -Xms2g}} 

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_80.txt, merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-04 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276523#comment-16276523
 ] 

Ismael Juma commented on KAFKA-6199:


[~rt_skyscanner], what's the broker heap size? The histogram shows about 17k 
byte arrays with a total of 1 GB, which doesn't seem too bad.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_80.txt, merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-04 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276514#comment-16276514
 ] 

Ismael Juma commented on KAFKA-6199:


[~yuzhih...@gmail.com], yes, please file a separate JIRA for the mbean issue 
and link back to this issue. It doesn't seem to be the reason for this issue.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> histo_live_80.txt, merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-12-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275818#comment-16275818
 ] 

Ted Yu commented on KAFKA-6199:
---

JmxReporter$KafkaMbean showed up near the top in the histo output.
In JmxReporter#removeAttribute() :
{code}
KafkaMbean mbean = this.mbeans.get(mBeanName);
if (mbean != null)
mbean.removeAttribute(metricName.name());
return mbean;
{code}
Shouldn't mbeans.remove(mBeanName) be called before returning ?

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, histo_live.txt, 
> merge_shortest_paths.png, path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-27 Thread Manikumar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267029#comment-16267029
 ] 

Manikumar commented on KAFKA-6199:
--

[~rt_skyscanner] It should be safe to run this command. I have not seen any 
issues. 
You can run this command once you observe high memory usage.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-27 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267002#comment-16267002
 ] 

Robin Tweedie commented on KAFKA-6199:
--

[~omkreddy] Would this be safe to run against production? What impact could I 
expect on the process if any?

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-27 Thread Manikumar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266985#comment-16266985
 ] 

Manikumar commented on KAFKA-6199:
--

Can you attach the output of of jmap command?

{quote}jdk/bin/jmap -histo:live PID{quote}

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-27 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266760#comment-16266760
 ] 

Robin Tweedie commented on KAFKA-6199:
--

We tried to prove that this client was the culprit, but we were unable to stop 
the heap growth after making changes to it.

Any thoughts on the heap dump pictures I shared [~ijuma] [~tedyu] ?

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-20 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259208#comment-16259208
 ] 

Robin Tweedie commented on KAFKA-6199:
--

I think I found the corresponding "old client" causing this problem this 
morning -- might help with reproducing the issue. It is logging a similar error 
around the same rate as we see on the Kafka broker:

{noformat}
[2017-11-20 10:20:17,495] ERROR kafka:102 Unable to receive data from Kafka
Traceback (most recent call last):
  File 
"/opt/kafka_offset_manager/venv/lib/python2.7/site-packages/kafka/conn.py", 
line 99, in _read_bytes
raise socket.error("Not enough data to read message -- did server kill 
socket?")
error: Not enough data to read message -- did server kill socket?
{noformat}

We have a python 2.7 process running to check topic and consumer offsets to 
report metrics. It was running {{kafka-python==0.9.3}} (current version is 
1.3.5). We are going to do some experiments to make sure that this is the 
culprit of the heap growth.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-19 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258663#comment-16258663
 ] 

Robin Tweedie commented on KAFKA-6199:
--

I'm going to share as much as I can from Eclipse Memory Analyzer running 
against a heap dump from the broker with its heap close to exhaustion.

from *Leak Suspects* (the second biggest thing was LogCleaner, but I think this 
was similar on the same broker right after it is restarted)
{noformat}
2,469 instances of "org.apache.kafka.common.network.NetworkSend", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x90102770" occupy 820.95 MB (70.46%) 
bytes. 

Keywords
sun.misc.Launcher$AppClassLoader @ 0x90102770
org.apache.kafka.common.network.NetworkSend
{noformat}

The {{NetworkSend}}s appear right at the top-level (see attached 
dominator_tree.png) -- using "path to GC Roots" tools on individual instances 
(screenshots also attached) does not seem to lead anywhere interesting, just 
more networking code.

The values inside the byte buffers just look like lists of broker hostnames.

I am not sure where would be useful to go from here. I looked at another heap 
dump right after the broker is restarted and there are very few NetworkSend 
objects. I think I'd see a similar data if I compared with a "healthy" broker 
(haven't done this, but could if it will help).

Any ideas?

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png, dominator_tree.png, merge_shortest_paths.png, 
> path2gc.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-13 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249426#comment-16249426
 ] 

Robin Tweedie commented on KAFKA-6199:
--

We tried moving the leadership of every single-partition topic off the broker 
in question on Friday, so we're less sure that it's any particular topic being 
produced / consumed now.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-11 Thread Robin Tweedie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248631#comment-16248631
 ] 

Robin Tweedie commented on KAFKA-6199:
--

[~ijuma] we do have quite a few Samza jobs which are on the 0.8 client. I 
should probably mention that we our cluster is pinned to the 0.9 log format for 
this reason. Is there any known issue with older clients regarding this?

[~tedyu] I can't attach the heap dumps as they may contain sensitive data, but 
I would be able to analyse them locally and share screenshots or other details.

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248546#comment-16248546
 ] 

Ted Yu commented on KAFKA-6199:
---

Robin:
Can you attach heap dump ?

Thanks

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6199) Single broker with fast growing heap usage

2017-11-10 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247575#comment-16247575
 ] 

Ismael Juma commented on KAFKA-6199:


Do you have older consumers?

> Single broker with fast growing heap usage
> --
>
> Key: KAFKA-6199
> URL: https://issues.apache.org/jira/browse/KAFKA-6199
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.1
> Environment: Amazon Linux
>Reporter: Robin Tweedie
> Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 
> 2017-11-10 at 11.59.06 AM.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage 
> which necessitates us restarting it every 12 hours. If we don't restart the 
> broker, it becomes very slow from long GC pauses and eventually has 
> {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage 
> percentage on the broker. A "normal" broker in the same cluster stays below 
> 50% (averaged) over the same time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously 
> high, and there are a lot of retained {{NetworkSend}} objects referencing 
> byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind 
> of warning than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open 
> connection, connection id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log 
> message visualized across all the brokers (to show it happens a bit on other 
> brokers, but not nearly as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin 
> down the problem better. We're currently trying to narrow it down to a 
> particular client, but without much success so far.
> Let me know what else I could investigate or share to track down the source 
> of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)