The first thing I'd take a look at is your `max.poll.records` setting. The default for streams is 1000 (see https://docs.confluent.io/current/streams/developer-guide/config-streams.html#default-values). Depending on your workloads, this could definitely cause long rebalances -- it did for me, but my workload requires some quite long processing times.
Regards, Raman On Wed, Jan 16, 2019 at 3:59 AM Javier Arias Losada <javier.ari...@gmail.com> wrote: > Dear all, > > we are starting to work with Kafka streams, our service is a very simple > stateless consumer. > > We have tight requirements on latency, and we are facing too high latency > problems when the consumer group is rebalancing. In our scenario, > rebalancing will happen relatively often: rolling updates of code, scaling > up/down the service, containers being shuffled by the cluster scheduler, > containers dying, hardware failing. > > One of the first tests we have done is having a small consumer group with > 4 consumers handling a small amount of messages (1K/sec) and killing one of > them; the cluster manager (currently AWS-ECS, probably soon moving to K8S) > starts a new one. So, more than one rebalancing is done. > > Our most critical metric is latency, which we measure as the milliseconds > between message creation and message consumption. We saw the maximum > latency spiking from a few milliseconds, to almost 15 seconds. > > [image: image.png] > > [image: image.png] > > [image: image.png] > > We also have done tests with some rolling updates of code and the results > are worse, since our deployment is not prepared for Kafka services and we > trigger a lot of rebalancings. We'll need to work on that, but wondering > what are the strategies followed by other people for doing code deployment > / autoscaling with the minimum possible delays. > > Not sure it might help, but our requirements are pretty relaxed related to > message processing: we don't care about some messages being processed twice > from time to time, or are very strict with the ordering of messages. > > We are using all default configurations, no tuning. > > We need to improve this latency spikes during rebalancing. > Can someone, please, give us some hints on how to work on it? Is touching > configurations enough? Do we need to use some concrete parition Asignor? > Implement our own? > > What is the recommended approach to code deployment / autoscaling with the > minimum possible delays? > > Our Kafka version is 1.1.0, after looking at libs found for example > kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. > In the consumer side, we are using Kafka-streams 2.1.0. > > Thank you for reading my question and your responses. > Best, > Javier Arias Losada >