Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Just want to share another variant of the log message which is also related to metadata and rebalancing but has a different client reason: INFO [GroupCoordinator 3]: Preparing to rebalance group in state PreparingRebalance with old generation nnn (__consumer_offsets-nn) (reason: Updating metadata for member during Stable; client reason: triggered followup rebalance scheduled for 0) (kafka.coordinator.group.GroupCoordinator) Thank you. Kind regards, Venkatesh From: Venkatesh Nagarajan Date: Wednesday, 13 March 2024 at 12:06 pm To: users@kafka.apache.org Subject: Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled Thanks very much for your important inputs, Matthias. I will use the default METADATA_MAX_AGE_CONFIG. I set it to 5 hours when I saw a lot of such rebalancing related messages in the MSK broker logs: INFO [GroupCoordinator 2]: Preparing to rebalance group in state PreparingRebalance with old generation (__consumer_offsets-nn) (reason: Updating metadata for member during Stable; client reason: need to revoke partitions and re-join) (kafka.coordinator.group.GroupCoordinator) I am guessing that the two are unrelated. If you have any suggestions on how to reduce such rebalancing, that will be very helpful. Thank you very much. Kind regards, Venkatesh From: Matthias J. Sax Date: Tuesday, 12 March 2024 at 1:31 pm To: users@kafka.apache.org Subject: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled Without detailed logs (maybe even DEBUG) hard to say. But from what you describe, it could be a metadata issue? Why are you setting > METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make > rebalances rare) Refreshing metadata has nothing to do with rebalances, and a metadata refresh does not trigger a rebalance. -Matthias On 3/10/24 5:56 PM, Venkatesh Nagarajan wrote: > Hi all, > > A Kafka Streams application sometimes stops consuming events during load > testing. Please find below the details: > > Details of the app: > > >* Kafka Streams Version: 3.5.1 >* Kafka: AWS MSK v3.6.0 >* Consumes events from 6 topics >* Calls APIs to enrich events >* Sometimes joins two streams >* Produces enriched events in output topics > > Runs on AWS ECS: > >* Each task has 10 streaming threads >* Autoscaling based on offset lags and a maximum of 6 ECS tasks >* Input topics have 60 partitions each to match 6 tasks * 10 threads >* Fairly good spread of events across all topic partitions using > partitioning keys > > Settings and configuration: > > >* At least once semantics >* MAX_POLL_RECORDS_CONFIG: 10 >* APPLICATION_ID_CONFIG > > // Make rebalances rare and prevent stop-the-world rebalances > >* Static membership (using GROUP_INSTANCE_ID_CONFIG) >* METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to > make rebalances rare) >* MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis >* SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis > > State store related settings: > >* TOPOLOGY_OPTIMIZATION_CONFIG: OPTIMIZE >* STATESTORE_CACHE_MAX_BYTES_CONFIG: 300 * 1024 * 1024L >* NUM_STANDBY_REPLICAS_CONFIG: 1 > > > Symptoms: > The symptoms mentioned below occur during load tests: > > Scenario# 1: > Steady input event stream > > Observations: > >* Gradually increasing offset lags which shouldn't happen normally as > the streaming app is quite fast >* Events get processed > > Scenario# 2: > No input events after the load test stops producing events > > Observations: > >* Offset lag stuck at ~5k >* Stable consumer group >* No events processed >* No errors or messages in the logs > > > Scenario# 3: > Restart the app when it stops processing events although offset lags are not > zero > > Observations: > >* Offset lags start reducing and events start getting processed > > Scenario# 4: > Transient errors occur while processing events > > >* A custom exception handler that implements > StreamsUncaughtExceptionHandler returns > StreamThreadExceptionResponse.REPLACE_THREAD in the handle method >* If transient errors keep occurring occasionally and threads get > replaced, the problem of the app stalling disappears. >* But if transient errors don't occur, the app tends to stall and I need > to manually restart it > > > Summary: > >* It appears that some streaming threads stall after processing for a > while. >* It is difficult to change log level for Kafka Streams from ERROR to > INFO as it starts producing a lot of log messages especially during load > tests. >* I haven't yet managed to push Kafka streams metrics into AWS OTEL > collector to get more insights. > > Can you please let me know if any Kafka Streams config settings need > changing? Should I reduce the values of any of these settings to help trigger > rebalancing early and h
Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Thanks very much for your important inputs, Matthias. I will use the default METADATA_MAX_AGE_CONFIG. I set it to 5 hours when I saw a lot of such rebalancing related messages in the MSK broker logs: INFO [GroupCoordinator 2]: Preparing to rebalance group in state PreparingRebalance with old generation (__consumer_offsets-nn) (reason: Updating metadata for member during Stable; client reason: need to revoke partitions and re-join) (kafka.coordinator.group.GroupCoordinator) I am guessing that the two are unrelated. If you have any suggestions on how to reduce such rebalancing, that will be very helpful. Thank you very much. Kind regards, Venkatesh From: Matthias J. Sax Date: Tuesday, 12 March 2024 at 1:31 pm To: users@kafka.apache.org Subject: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled Without detailed logs (maybe even DEBUG) hard to say. But from what you describe, it could be a metadata issue? Why are you setting > METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make > rebalances rare) Refreshing metadata has nothing to do with rebalances, and a metadata refresh does not trigger a rebalance. -Matthias On 3/10/24 5:56 PM, Venkatesh Nagarajan wrote: > Hi all, > > A Kafka Streams application sometimes stops consuming events during load > testing. Please find below the details: > > Details of the app: > > >* Kafka Streams Version: 3.5.1 >* Kafka: AWS MSK v3.6.0 >* Consumes events from 6 topics >* Calls APIs to enrich events >* Sometimes joins two streams >* Produces enriched events in output topics > > Runs on AWS ECS: > >* Each task has 10 streaming threads >* Autoscaling based on offset lags and a maximum of 6 ECS tasks >* Input topics have 60 partitions each to match 6 tasks * 10 threads >* Fairly good spread of events across all topic partitions using > partitioning keys > > Settings and configuration: > > >* At least once semantics >* MAX_POLL_RECORDS_CONFIG: 10 >* APPLICATION_ID_CONFIG > > // Make rebalances rare and prevent stop-the-world rebalances > >* Static membership (using GROUP_INSTANCE_ID_CONFIG) >* METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to > make rebalances rare) >* MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis >* SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis > > State store related settings: > >* TOPOLOGY_OPTIMIZATION_CONFIG: OPTIMIZE >* STATESTORE_CACHE_MAX_BYTES_CONFIG: 300 * 1024 * 1024L >* NUM_STANDBY_REPLICAS_CONFIG: 1 > > > Symptoms: > The symptoms mentioned below occur during load tests: > > Scenario# 1: > Steady input event stream > > Observations: > >* Gradually increasing offset lags which shouldn't happen normally as > the streaming app is quite fast >* Events get processed > > Scenario# 2: > No input events after the load test stops producing events > > Observations: > >* Offset lag stuck at ~5k >* Stable consumer group >* No events processed >* No errors or messages in the logs > > > Scenario# 3: > Restart the app when it stops processing events although offset lags are not > zero > > Observations: > >* Offset lags start reducing and events start getting processed > > Scenario# 4: > Transient errors occur while processing events > > >* A custom exception handler that implements > StreamsUncaughtExceptionHandler returns > StreamThreadExceptionResponse.REPLACE_THREAD in the handle method >* If transient errors keep occurring occasionally and threads get > replaced, the problem of the app stalling disappears. >* But if transient errors don't occur, the app tends to stall and I need > to manually restart it > > > Summary: > >* It appears that some streaming threads stall after processing for a > while. >* It is difficult to change log level for Kafka Streams from ERROR to > INFO as it starts producing a lot of log messages especially during load > tests. >* I haven't yet managed to push Kafka streams metrics into AWS OTEL > collector to get more insights. > > Can you please let me know if any Kafka Streams config settings need > changing? Should I reduce the values of any of these settings to help trigger > rebalancing early and hence assign partitions to members that are active: > > >* METADATA_MAX_AGE_CONFIG: 5 hours in millis (to make rebalances rare) >* MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis >* SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis > > Should I get rid of static membership – this may increase rebalancing but may > be okay if it can prevent stalled threads from appearing as active members > > Should I try upgrading Kafka Streams to v3.6.0 or v3.7.0? Hoping that v3.7.0 > will be compatible with AWS MSK v3.6.0. > > > Thank you very much. > > Kind regards, > Venkatesh > > UTS CRICOS Provider Code: 00099F DISCLAIMER: This email m