Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled

2024-03-12 Thread Venkatesh Nagarajan
Just want to share another variant of the log message which is also related to 
metadata and rebalancing but has a different client reason:

INFO [GroupCoordinator 3]: Preparing to rebalance group  in state 
PreparingRebalance with old generation nnn (__consumer_offsets-nn) (reason: 
Updating metadata for member  during Stable; client reason: triggered 
followup rebalance scheduled for 0) (kafka.coordinator.group.GroupCoordinator)

Thank you.

Kind regards,
Venkatesh

From: Venkatesh Nagarajan 
Date: Wednesday, 13 March 2024 at 12:06 pm
To: users@kafka.apache.org 
Subject: Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Thanks very much for your important inputs, Matthias.

I will use the default METADATA_MAX_AGE_CONFIG. I set it to 5 hours when I saw 
a lot of such rebalancing related messages in the MSK broker logs:

INFO [GroupCoordinator 2]: Preparing to rebalance group  in state 
PreparingRebalance with old generation  (__consumer_offsets-nn) (reason: 
Updating metadata for member  during Stable; client reason: need to 
revoke partitions and re-join) (kafka.coordinator.group.GroupCoordinator)

I am guessing that the two are unrelated. If you have any suggestions on how to 
reduce such rebalancing, that will be very helpful.

Thank you very much.

Kind regards,
Venkatesh

From: Matthias J. Sax 
Date: Tuesday, 12 March 2024 at 1:31 pm
To: users@kafka.apache.org 
Subject: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Without detailed logs (maybe even DEBUG) hard to say.

But from what you describe, it could be a metadata issue? Why are you
setting

> METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make 
> rebalances rare)

Refreshing metadata has nothing to do with rebalances, and a metadata
refresh does not trigger a rebalance.



-Matthias


On 3/10/24 5:56 PM, Venkatesh Nagarajan wrote:
> Hi all,
>
> A Kafka Streams application sometimes stops consuming events during load 
> testing. Please find below the details:
>
> Details of the app:
>
>
>*   Kafka Streams Version: 3.5.1
>*   Kafka: AWS MSK v3.6.0
>*   Consumes events from 6 topics
>*   Calls APIs to enrich events
>*   Sometimes joins two streams
>*   Produces enriched events in output topics
>
> Runs on AWS ECS:
>
>*   Each task has 10 streaming threads
>*   Autoscaling based on offset lags and a maximum of 6 ECS tasks
>*   Input topics have 60 partitions each to match 6 tasks * 10 threads
>*   Fairly good spread of events across all topic partitions using 
> partitioning keys
>
> Settings and configuration:
>
>
>*   At least once semantics
>*   MAX_POLL_RECORDS_CONFIG: 10
>*   APPLICATION_ID_CONFIG
>
> // Make rebalances rare and prevent stop-the-world rebalances
>
>*   Static membership (using GROUP_INSTANCE_ID_CONFIG)
>*   METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to 
> make rebalances rare)
>*   MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis
>*   SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis
>
> State store related settings:
>
>*   TOPOLOGY_OPTIMIZATION_CONFIG: OPTIMIZE
>*   STATESTORE_CACHE_MAX_BYTES_CONFIG: 300 * 1024 * 1024L
>*   NUM_STANDBY_REPLICAS_CONFIG: 1
>
>
> Symptoms:
> The symptoms mentioned below occur during load tests:
>
> Scenario# 1:
> Steady input event stream
>
> Observations:
>
>*   Gradually increasing offset lags which shouldn't happen normally as 
> the streaming app is quite fast
>*   Events get processed
>
> Scenario# 2:
> No input events after the load test stops producing events
>
> Observations:
>
>*   Offset lag stuck at ~5k
>*   Stable consumer group
>*   No events processed
>*   No errors or messages in the logs
>
>
> Scenario# 3:
> Restart the app when it stops processing events although offset lags are not 
> zero
>
> Observations:
>
>*   Offset lags start reducing and events start getting processed
>
> Scenario# 4:
> Transient errors occur while processing events
>
>
>*   A custom exception handler that implements 
> StreamsUncaughtExceptionHandler returns 
> StreamThreadExceptionResponse.REPLACE_THREAD in the handle method
>*   If transient errors keep occurring occasionally and threads get 
> replaced, the problem of the app stalling disappears.
>*   But if transient errors don't occur, the app tends to stall and I need 
> to manually restart it
>
>
> Summary:
>
>*   It appears that some streaming threads stall after processing for a 
> while.
>*   It is difficult to change log level for Kafka Streams from ERROR to 
> INFO as it starts producing a lot of log messages especially during load 
> tests.
>*   I haven't yet managed to push Kafka streams metrics into AWS OTEL 
> collector to get more insights.
>
> Can you please let me know if any Kafka Streams config settings need 
> changing? Should I reduce the values of any of these settings to help trigger 
> rebalancing early and h

Re: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled

2024-03-12 Thread Venkatesh Nagarajan
Thanks very much for your important inputs, Matthias.

I will use the default METADATA_MAX_AGE_CONFIG. I set it to 5 hours when I saw 
a lot of such rebalancing related messages in the MSK broker logs:

INFO [GroupCoordinator 2]: Preparing to rebalance group  in state 
PreparingRebalance with old generation  (__consumer_offsets-nn) (reason: 
Updating metadata for member  during Stable; client reason: need to 
revoke partitions and re-join) (kafka.coordinator.group.GroupCoordinator)

I am guessing that the two are unrelated. If you have any suggestions on how to 
reduce such rebalancing, that will be very helpful.

Thank you very much.

Kind regards,
Venkatesh

From: Matthias J. Sax 
Date: Tuesday, 12 March 2024 at 1:31 pm
To: users@kafka.apache.org 
Subject: [EXTERNAL] Re: Kafka Streams 3.5.1 based app seems to get stalled
Without detailed logs (maybe even DEBUG) hard to say.

But from what you describe, it could be a metadata issue? Why are you
setting

> METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to make 
> rebalances rare)

Refreshing metadata has nothing to do with rebalances, and a metadata
refresh does not trigger a rebalance.



-Matthias


On 3/10/24 5:56 PM, Venkatesh Nagarajan wrote:
> Hi all,
>
> A Kafka Streams application sometimes stops consuming events during load 
> testing. Please find below the details:
>
> Details of the app:
>
>
>*   Kafka Streams Version: 3.5.1
>*   Kafka: AWS MSK v3.6.0
>*   Consumes events from 6 topics
>*   Calls APIs to enrich events
>*   Sometimes joins two streams
>*   Produces enriched events in output topics
>
> Runs on AWS ECS:
>
>*   Each task has 10 streaming threads
>*   Autoscaling based on offset lags and a maximum of 6 ECS tasks
>*   Input topics have 60 partitions each to match 6 tasks * 10 threads
>*   Fairly good spread of events across all topic partitions using 
> partitioning keys
>
> Settings and configuration:
>
>
>*   At least once semantics
>*   MAX_POLL_RECORDS_CONFIG: 10
>*   APPLICATION_ID_CONFIG
>
> // Make rebalances rare and prevent stop-the-world rebalances
>
>*   Static membership (using GROUP_INSTANCE_ID_CONFIG)
>*   METADATA_MAX_AGE_CONFIG (consumer and producer): 5 hours in millis (to 
> make rebalances rare)
>*   MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis
>*   SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis
>
> State store related settings:
>
>*   TOPOLOGY_OPTIMIZATION_CONFIG: OPTIMIZE
>*   STATESTORE_CACHE_MAX_BYTES_CONFIG: 300 * 1024 * 1024L
>*   NUM_STANDBY_REPLICAS_CONFIG: 1
>
>
> Symptoms:
> The symptoms mentioned below occur during load tests:
>
> Scenario# 1:
> Steady input event stream
>
> Observations:
>
>*   Gradually increasing offset lags which shouldn't happen normally as 
> the streaming app is quite fast
>*   Events get processed
>
> Scenario# 2:
> No input events after the load test stops producing events
>
> Observations:
>
>*   Offset lag stuck at ~5k
>*   Stable consumer group
>*   No events processed
>*   No errors or messages in the logs
>
>
> Scenario# 3:
> Restart the app when it stops processing events although offset lags are not 
> zero
>
> Observations:
>
>*   Offset lags start reducing and events start getting processed
>
> Scenario# 4:
> Transient errors occur while processing events
>
>
>*   A custom exception handler that implements 
> StreamsUncaughtExceptionHandler returns 
> StreamThreadExceptionResponse.REPLACE_THREAD in the handle method
>*   If transient errors keep occurring occasionally and threads get 
> replaced, the problem of the app stalling disappears.
>*   But if transient errors don't occur, the app tends to stall and I need 
> to manually restart it
>
>
> Summary:
>
>*   It appears that some streaming threads stall after processing for a 
> while.
>*   It is difficult to change log level for Kafka Streams from ERROR to 
> INFO as it starts producing a lot of log messages especially during load 
> tests.
>*   I haven't yet managed to push Kafka streams metrics into AWS OTEL 
> collector to get more insights.
>
> Can you please let me know if any Kafka Streams config settings need 
> changing? Should I reduce the values of any of these settings to help trigger 
> rebalancing early and hence assign partitions to members that are active:
>
>
>*   METADATA_MAX_AGE_CONFIG: 5 hours in millis (to make rebalances rare)
>*   MAX_POLL_INTERVAL_MS_CONFIG: 20 minutes in millis
>*   SESSION_TIMEOUT_MS_CONFIG: 2 minutes in millis
>
> Should I get rid of static membership – this may increase rebalancing but may 
> be okay if it can prevent stalled threads from appearing as active members
>
> Should I try upgrading Kafka Streams to v3.6.0 or v3.7.0? Hoping that v3.7.0 
> will be compatible with AWS MSK v3.6.0.
>
>
> Thank you very much.
>
> Kind regards,
> Venkatesh
>
> UTS CRICOS Provider Code: 00099F DISCLAIMER: This email m