[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2020-03-25 Thread Andrew Olson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066833#comment-17066833
 ] 

Andrew Olson commented on KAFKA-8950:
-

We're currently testing 2.3.1 now to see if it resolves this issue.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-25 Thread Andrew Olson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959954#comment-16959954
 ] 

Andrew Olson commented on KAFKA-8950:
-

Ok, sounds good. Thanks.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-25 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959918#comment-16959918
 ] 

Tom Lee commented on KAFKA-8950:


Yeah sounds very similar. If 2.3.1 doesn't help when it ships, I'd open a new 
ticket & attach a heap dump and a stack trace next time it gets stuck.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-25 Thread tony mancill (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959913#comment-16959913
 ] 

tony mancill commented on KAFKA-8950:
-

[~noslowerdna] That sounds exactly like the issue we encountered (I work with 
[~thomaslee]).  We rebuilt the 2.3.0 tag + Will's patch and are looking forward 
to 2.3.1.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-25 Thread Andrew Olson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959891#comment-16959891
 ] 

Andrew Olson commented on KAFKA-8950:
-

Thank you for the detailed and helpful response, [~thomaslee]. The symptom we 
see is that a random partition occasionally (maybe once every month or two) 
gets "stuck" and accumulates lag indefinitely while all other partitions 
continue to be consumed normally, with the only solution being to restart the 
consumer. Our consumer groups are relatively large (about 20 to 80 members) 
with each member reading from about 10 to 25 partitions usually, some high 
volume (up to 5k messages per second) and some low. We don't auto-commit 
offsets.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-25 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959813#comment-16959813
 ] 

Tom Lee commented on KAFKA-8950:


This specific bug was new in 2.3.0, yep. IIRC the map that would get into an 
inconsistent state and cause all the trouble was introduced in 2.3.0. Looking 
at 0.10.2.2 specifically, I think the code as implemented would have been fine.

If it's helpful, we've also had various issues with earlier versions of the 
client libs too but more often than not the issues were mitigated by config, 
upgrading client libs or system-level tuning: sysctls like tcp_retries2 & 
tcp_syn_retries are often set too high, misconfigured NICs can be an issue 
because of packet loss, stuff like that.

Request timeouts for consumers in ~0.10 were extremely high because of some 
sort of coupling with group rebalances and iirc this didn't get fixed until 
2.0.0. See 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-266%3A+Fix+consumer+indefinite+blocking+behavior.
 Producers had similar but different issues with certain configuration options. 
Some of this was more difficult to work around directly without the 2.3.x 
upgrade.

Not to say there are no more issues, but a custom build of 2.3.0 with Will's 
patch has been solid for us so far. By comparison, "vanilla" 2.3.0 would cause 
us trouble maybe once or twice a day.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-25 Thread Andrew Olson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959782#comment-16959782
 ] 

Andrew Olson commented on KAFKA-8950:
-

Was this confirmed to be first introduced in 2.3.0? If so do you know the 
introductory JIRA issue?

We may have seen similar in previous Kafka versions (0.10.2.1, 2.2.1), so 
curious if the bug is truly new or might have been around for a while.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Will James (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951323#comment-16951323
 ] 

Will James commented on KAFKA-8950:
---

PR here [https://github.com/apache/kafka/pull/7511]

Let me know about any thoughts.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Will James (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951229#comment-16951229
 ] 

Will James commented on KAFKA-8950:
---

Thanks, Yea it would be great to submit the PR assuming I can write a good unit 
test. I'll take a stab this afternoon.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951187#comment-16951187
 ] 

Rajini Sivaram commented on KAFKA-8950:
---

Thanks [~thomaslee]

[~mumrah] Are we likely to have another RC for 2.3.1? This is a critical 
defect, so it will be good if we can include in 2.3.1.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951169#comment-16951169
 ] 

Tom Lee commented on KAFKA-8950:


Sounds good to me [~rsivaram], agree the fix seems trivial. To be clear, happy 
to submit a patch if Will's otherwise occupied -- just wanted to give him first 
right of refusal. :)

Separately: without knowing exactly what release schedules look like, is there 
any possibility of this making a 2.3 point release?

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951159#comment-16951159
 ] 

Rajini Sivaram commented on KAFKA-8950:
---

We can either move `nodesWithPendingFetchRequests.add()` before send and remove 
it if client.send throws an exception or track immediate firing of the listener 
and call `nodesWithPendingFetchRequests.add()` only if required. Either would 
work. The code change should be trivial, but we need to add a unit test. I am 
happy to submit a PR, but wanted to check if either of you wanted to submit the 
PR first. [~wtjames] ?

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951140#comment-16951140
 ] 

Tom Lee commented on KAFKA-8950:


[~rsivaram] hm new flag? Couldn't we just move the 
nodesWithPendingFetchRequests.put() above the call to client.send()? Maybe I'm 
misunderstanding.

I'd defer to [~wtjames] on the PR front, but this is really hurting us -- also 
happy to figure it out on our end if folks aren't likely to get this addressed 
soon.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951126#comment-16951126
 ] 

Rajini Sivaram commented on KAFKA-8950:
---

The scenario that [~wtjames] described is indeed possible. Presumably neither 
environment has debug logs enabled that would have showm a disconnection being 
processed. I guess this would be easy to fix by adding a flag since the code is 
synchronized and the same thread needs to have run the listener method. Will 
either of you be interested in submitting a PR?

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951100#comment-16951100
 ] 

Tom Lee commented on KAFKA-8950:


Oh wait, the RequestFutureCompletionHandlers can be invoked by the heartbeat 
thread too, not just the consumer thread.  I think I'm catching up. :)

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951096#comment-16951096
 ] 

Tom Lee commented on KAFKA-8950:


[~rsivaram] plaintext transport here too. Not sure we could provide the full 
thread dump, but for now happy to answer specific questions if you have any 
hunches. Will see what I can figure out wrt the thread dump.

[~wtjames] ah I think I see what you're saying. So for consumer thread C and 
heartbeat/coordinator thread H:

C @ T=0: creates the RequestFuture, adds it to _unsent_ but does not quite get 
to the point where we add the listener
H @ T=1: somehow completes the future from T=0 (e.g. disconnect processing)
C @ T=2: adds the listener, which is immediately invoked on the calling thread 
and attempts to remove the nodesWithPendingFetchRequests entry before it has 
been added
C @ T=3: adds the nodesWithPendingFetchRequests entry, which will never be 
removed because the listener has already fired

Only thing I'm not seeing now is how the futures could actually get 
completed/failed directly on H ... from what I can see they'd typically be 
enqueued into _pendingCompletion_ by RequestFutureCompletionHandler & processed 
on the consumer thread rather than being called directly. It would only need to 
happen once, though. Very interesting, does seem precarious at the very least.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Will James (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951009#comment-16951009
 ] 

Will James commented on KAFKA-8950:
---

[~thomaslee] I guess I'm saying the listener is indeed called, but it is called 
before 
[https://github.com/apache/kafka/blob/6fb8bd0987d2da1ded55383a02692a34f120f3b4/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L336]
 where we add the node to the list of nodes with pending requests. In that 
case, the finally block will be invoked before any node has been added in which 
case we will be in an invalid state. And the lock doesn't help because the 
future will be invoked on the same thread and these locks are reentrant.

Actually I was able to reproduce this in a test, but only by adding a 
Thread.sleep in between client.send and .addListener in the fetcher and 
continually calling .disconnectAsync and .pollNoWakeup in another thread... 

Thanks [~rsivaram]. I believe we are using plaintext transport. I will look 
into getting you a partial thread dump.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-14 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950883#comment-16950883
 ] 

Rajini Sivaram commented on KAFKA-8950:
---

[~wtjames] [~thomaslee]  Thanks for reporting this issue. This is clearly a 
very critical issue we should try to fix before the upcoming 2.4.0 release.

Are both of you using SSL (or SASL_SSL) - basically SSL transport layer?

Would either of you be willing to share a heap dump of the consumer in this 
state? If it is not possible to share the heap dump, would it be possible to 
share all the information from the heap dump related to the channel? All the 
fields (the actual contents of the field objects) of KafkaChannel and 
SslTransportLayer if using SSL will be useful. Selector would also be useful.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-13 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950707#comment-16950707
 ] 

Tom Lee commented on KAFKA-8950:


Ah wait I misunderstood what you said above: so not trouble with the listener 
execution itself, but the updates to the nodesWithPendingFetchRequests map. I 
don't think it's a problem because Fetcher.send() is synchronized, and both 
onSuccess/onFailure are protected by synchronized blocks. Even if the future 
fires first, send() still holds the lock.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-13 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950702#comment-16950702
 ] 

Tom Lee commented on KAFKA-8950:


Hm not sure I see what you mean. I think you're right that checkDisconnects 
could be called by the coordinator/heartbeat thread at the same time send is 
invoked by the fetcher, so no disagreement there. But 
RequestFuture.addListener() will enqueue the listener in a 
ConcurrentLinkedQueue, then check if the future was previously succeeded or 
failed by checking an atomic reference before invoking fire{Success,Failure} 
which will then invoke the enqueued listener.

So say we enqueue the listener then we "see" that the future is neither 
succeeded or failed. It's then guaranteed that the heartbeat/coordinator thread 
will invoke the listener because the enqueue "happened-before" the atomic 
reference write. On the other hand if the atomic reference write 
"happened-before" we check if the future is succeeded or failed, addListener 
will execute fireSuccess/fireFailure which will invoke the listener logic.

It's certainly delicate, but I don't think it's incorrect.

I personally think the issue is that we're somehow entirely missing the 
complete()/raise() somewhere, but I'd be happy to be wrong if we could get it 
fixed. :)


> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-13 Thread Will James (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950688#comment-16950688
 ] 

Will James commented on KAFKA-8950:
---

Yep that sounds exactly like our issue. I got about as far as you in reading 
the code and speculating.

Re: failing to invoke that listener, there is one case where I can see we would 
fail to invoke the listener: that is if the future immediately fails (or 
succeeds, but that is less likely), and so when we add the listener the 
callback is immediately invoked _before_ we add the pending node into the set 
of nodes with pending requests, which would mean it is never removed.

This may be possible if another thread (heartbeat thread?) is polling and had 
called ConsumerNetworkClient#checkDisconnects while the fetcher is calling 
ConsumerNetworkClient#send. I'm not sure if this scenario can actually happen 
but it's pretty easy to see that the fetcher listener attachment is not thread 
safe. If another thread is able to immediately succeed or fail the future 
before the Fetcher is able to attach it's listener, the Fetcher will be in a 
broken state.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-10-13 Thread Tom Lee (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950644#comment-16950644
 ] 

Tom Lee commented on KAFKA-8950:


Believe we're also running into this. High throughput topics/partitions seem to 
be more frequently impacted than others. Like [~wtjames] we saw 
_nodesWithPendingFetchRequests_ indicate that a request was in-flight and we 
saw poll() regularly return an empty record set at our configured timeout 
interval. We were able to dig a little deeper via a heap dump and see there 
were no requests actually in-flight according to 
ConsumerNetworkClient/NetworkClient/Selector, nor were there any indications of 
pending/completed/failed/aborted/unsent requests at the time of the heap dump. 
inFlightRequestsCount was zero. The state of the fetcher's 
nodesWithPendingFetchRequests map just seems to disagree with the underlying 
ConsumerNetworkClient/NetworkClient/Selector. The Fetcher believes a fetch 
response (or timeout) will eventually come and it never does.

Perhaps worst of all the heartbeat/coordinator thread continued sending 
heartbeats etc. as normal so no rebalances occurred and the "stuck" partition 
was not released by the consumer. The state of the leader's broker connection 
was still READY. No broker  connect/reconnect activity as best I can tell from 
metrics. No errors in the logs or exceptions thrown as far as I could see. 

Given how simple the synchronization around the nodesWithPendingFetchRequests 
map is, it's as though we're somehow failing to invoke [this request 
listener|https://github.com/apache/kafka/blob/2.3.0/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L248-L249]
 or something like that, but the underlying networking code is tricky enough 
that it's difficult for me to speculate where. I've been eyeballing the 
_UnsentRequests_ data structure since it seems to be one of the few situations 
that we're manipulating state outside of coarse synchronized blocks in both the 
coordinator & consumer threads and it seems there's been bugs reported against 
it in the past, but so far I've come up empty.

I can imagine that an exception thrown at the wrong time/place could cause the 
RequestFuture listener not to be invoked, but I have no evidence of that 
actually happening.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-09-27 Thread Will James (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939538#comment-16939538
 ] 

Will James commented on KAFKA-8950:
---

Here are relevant logs from before one occurance of the issue. 

{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.consumer.KafkaConsumer [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Subscribed to topic(s): [redacted] #012 }}
{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
org.apache.kafka.clients.Metadata [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Cluster ID: 5hzjksKUQJSFOa84oTBWdw #012 }}
{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Discovered group coordinator 
kafka-6-us-west-2.instana.io:9092 (id: 2147483646 rack: null) #012 }}
{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Revoking previously assigned partitions [] #012 }}
{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] (Re-)joining group #012 }}
{{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Successfully joined group with generation 2890 #012 }}
{{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Setting newly assigned partitions: [redacted]-0 #012 }}
{{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Setting offset for partition [redacted]-0 to the committed 
offset FetchPosition\{offset=8999851092, offsetEpoch=Optional.empty, 
currentLeader=LeaderAndEpoch{leader=kafka-13-us-west-2.instana.io:9092 (id: 8 
rack: null), epoch=-1}} #012 }}
{{Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Seeking to LATEST offset of partition [redacted]-0 #012 }}
{{Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Resetting offset for partition [redacted]-0 to offset 
8999872779. #012 }}

The consumer stopped consuming around 07:00 on the same day (although in some 
cases the error occurs days after the consumer starts). I don't really see 
anything super interesting in here. Basically it is just seeking to the end of 
the log.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing aro

[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching

2019-09-26 Thread Dhruvil Shah (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939062#comment-16939062
 ] 

Dhruvil Shah commented on KAFKA-8950:
-

Thanks for reporting the issue. If you have consumer logs from around the time 
it stopped making progress, that may be helpful.

> KafkaConsumer stops fetching
> 
>
> Key: KAFKA-8950
> URL: https://issues.apache.org/jira/browse/KAFKA-8950
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: Will James
>Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)