[jira] [Updated] (KAFKA-13126) Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13126:
---
Description: 
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) but have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}} – basically they will only do so after 
processing the last record from the previously polled batch. So in heavy 
processing cases, where each record takes a long time to process, or when using 
a very large  {{max.poll.records}}, it can be difficult to make any progress at 
all before dropping out and needing to rejoin. And of course, the rebalance 
that is kicked off upon this member rejoining can result in many of the other 
members in the group dropping out as well, leading to an endless cycle of 
missed rebalances.

We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it 
occurs. The workaround until then is of course to just set the 
{{max.poll.interval.ms}} to MAX_VALUE - 5000 (5s is the 
JOIN_GROUP_TIMEOUT_LAPSE)

  was:
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) but have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}} – basically they will only do so after 
processing the last record from the previously polled batch. So in heavy 
processing cases, where each record takes a long time to process, or when using 
a very large  {{max.poll.records}}, it can be difficult to make any progress at 
all before dropping out and needing to rejoin. And of course, the rebalance 
that is kicked off upon this member rejoining can result in many of the other 
members in the group dropping out as well, leading to an endless cycle of 
missed rebalances.

We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it 
occurs.


> Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads 
> to missing rebalances
> -
>
> Key: KAFKA-13126
> URL: https://issues.apache.org/jira/browse/KAFKA-13126
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.1.0
>
>
> In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
> overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
> override, users of both the plain consumer client and kafka streams still set 
> the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
> overflow when computing the {{joinGroupTimeoutMs}} and results in it being 
> set to the {{request.timeout.ms}} instead, which is much lower.
> This can easily make consumers drop out of the group, since they must rejoin 
> now within 30s (by default) but have no obligation to almost ever call poll() 
> given the high {{max.poll.interval.ms}} – basically they will only do so 
> after processing the last record from the previously polled batch. So in 
> heavy processing cases, where each record takes a long time to process, or 
> when using a very large  {{max.poll.records}}, it can be difficult to make 
> any progress at all before dropping out and needing to rejoin. And of course, 
> the rebalance that is kicked off upon this member rejoining can result in 
> many of the other members in the group dropping out as well, leading to an 
> endless cycle of missed rebalances.
> We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when 
> it occurs. The workaround until then is of course to just set the 
> {{max.poll.interval.ms}} to MAX_VALUE - 5000 (5s is the 
> JOIN_GROUP_TIMEOUT_LAPSE)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13126) Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13126:
---
Description: 
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) but have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}} – basically they will only do so after 
processing the last record from the previously polled batch. So in heavy 
processing cases, where each record takes a long time to process, or when using 
a very large  {{max.poll.records}}, it can be difficult to make any progress at 
all before dropping out and needing to rejoin. And of course, the rebalance 
that is kicked off upon this member rejoining can result in many of the other 
members in the group dropping out as well, leading to an endless cycle of 
missed rebalances.

We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it 
occurs.

  was:
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) yet have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}}. We just need to check for overflow and 
fix it to {{Integer.MAX_VALUE}} when it occurs.


> Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads 
> to missing rebalances
> -
>
> Key: KAFKA-13126
> URL: https://issues.apache.org/jira/browse/KAFKA-13126
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.1.0
>
>
> In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
> overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
> override, users of both the plain consumer client and kafka streams still set 
> the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
> overflow when computing the {{joinGroupTimeoutMs}} and results in it being 
> set to the {{request.timeout.ms}} instead, which is much lower.
> This can easily make consumers drop out of the group, since they must rejoin 
> now within 30s (by default) but have no obligation to almost ever call poll() 
> given the high {{max.poll.interval.ms}} – basically they will only do so 
> after processing the last record from the previously polled batch. So in 
> heavy processing cases, where each record takes a long time to process, or 
> when using a very large  {{max.poll.records}}, it can be difficult to make 
> any progress at all before dropping out and needing to rejoin. And of course, 
> the rebalance that is kicked off upon this member rejoining can result in 
> many of the other members in the group dropping out as well, leading to an 
> endless cycle of missed rebalances.
> We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when 
> it occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)