Re: Frequent consumer rebalance, auto commit failures

amit pal Thu, 24 May 2018 20:48:23 -0700

Hi Shantanu,

If you are using kafka stream, upgrade to the latest jar. There are a bunch
of fixes in the way it uses kafka consumers.


Apart from this: try these settings
1. Set the session.timeout.ms value higher, to something like 300000
2. Set the heartbeat.interval.ms to lower value, something like 2000.
3. Set the max.poll.interval.ms to some reasonable value.

if your processing takes time, you can reduce max.poll.records down to 1.



On Thu, May 24, 2018 at 9:27 PM Shantanu Deshmukh <shantanu...@gmail.com>
wrote:

> Hey Vincent.
> That's exactly how my code is. I am doing processing within that for loop.
>
> In KIP-62 I read that heartbeat happens via a separate thread
> https://github.com/dpkp/kafka-python/issues/948. But you are saying it
> happens through polling. What can be considered true?  I have set
> session.timeout.ms to 5 minutes. max.poll.records is set to 5. So even if
> my message takes 30 seconds to process, it still shouldn't cross this
> threshold. Yet I see frequent rebalances. Then there is
> max.poll.interval.ms
> too. Don't exactly know how it affects. But overall I am finding it very
> difficult to understand these myriads of settings, also documentation is
> not very clear.
>
> On Thu, May 24, 2018 at 8:09 PM Vincent Maurin <vincent.mau...@glispa.com>
> wrote:
>
> > Shantanu, I was more referering to you application code.
> > You should have something similar to :
> >
> > while (true) {
> >     ConsumerRecords<String, String> records = consumer.poll(100);
> >     for (ConsumerRecord<String, String> record : records) {
> >           // Your logic
> >     }
> > }
> >
> > You should make sure that the code within the loop doesn't take too much
> > time (more than session.timeout.ms)
> > From the consumer javadoc
> > "The consumer will automatically ping the cluster periodically, which
> lets
> > the cluster know that it is alive. Note that the consumer is
> > single-threaded, so periodic heartbeats can only be sent when poll(long)
> > <
> >
> https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#poll(long)
> > >
> > is called. As long as the consumer is able to do this it is considered
> > alive and retains the right to consume from the partitions assigned to
> it.
> > If it stops heartbeating by failing to call poll(long)
> > <
> >
> https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#poll(long)
> > >
> > for a period of time longer than session.timeout.ms then it will be
> > considered dead and its partitions will be assigned to another process."
> >
> > Best
> >
> > On Thu, May 24, 2018 at 4:07 PM Shantanu Deshmukh <shantanu...@gmail.com
> >
> > wrote:
> >
> > > Another observation is that when I restart my application. Consumption
> > > doesn't start till 5-6 minutes. In kafka consumer logs I see
> > >
> > > ConsumerCoordinator.333 - Revoking previously assigned partitions []
> for
> > > group notifications-consumer
> > > AbstractCoordinator:381 - (Re-)joining group notifications-consumer
> > >
> > > Then nothing. After 5-6 minutes activities start.
> > >
> > > On Thu, May 24, 2018 at 6:49 PM Shantanu Deshmukh <
> shantanu...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Vincent,
> > > >
> > > > Yes I reduced max.poll.records to get that same effect. I reduced it
> > all
> > > > the way down to 5 records still I am seeing same error. What else can
> > be
> > > > done? For one topic I can see that a single message processing is
> > taking
> > > > about 20 seconds. So 5 of them will take 1 minute. So I set
> > > > session.timeout.ms to 5 minutes, max.poll.interval.ms to 10 minutes.
> > But
> > > > it is not helping still.
> > > >
> > > > On Thu, May 24, 2018 at 6:15 PM Vincent Maurin <
> > > vincent.mau...@glispa.com>
> > > > wrote:
> > > >
> > > >> Hello Shantanu,
> > > >>
> > > >> It is also important to consider your consumer code. You should not
> > > spend
> > > >> to much time in between two calls to "poll" method. Otherwise, the
> > > >> consumer
> > > >> not calling poll will be considered dead by the group, triggering a
> > > >> rebalancing.
> > > >>
> > > >> Best
> > > >>
> > > >> On Thu, May 24, 2018 at 1:45 PM M. Manna <manme...@gmail.com>
> wrote:
> > > >>
> > > >> > Set your rebalance.backoff.ms=10000 and
> > zookeeper.session.timeout.ms
> > > >> =30000
> > > >> > in addition to what Manikumar said.
> > > >> >
> > > >> >
> > > >> >
> > > >> > On 24 May 2018 at 12:41, Shantanu Deshmukh <shantanu...@gmail.com
> >
> > > >> wrote:
> > > >> >
> > > >> > > Hello,
> > > >> > >
> > > >> > > There was a type in my first mail. session.timeout.ms is
> actually
> > > >> 60000
> > > >> > > not
> > > >> > > 6000. So it is less than heartbeat.interval.ms.
> > > >> > >
> > > >> > > On Thu, May 24, 2018 at 2:46 PM Manikumar <
> > > manikumar.re...@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > heartbeat.interval.ms should be lower than session.timeout.ms
> .
> > > >> > > >
> > > >> > > > Check here:
> > > >> > > >
> > > http://kafka.apache.org/0101/documentation.html#newconsumerconfigs
> > > >> > > >
> > > >> > > >
> > > >> > > > On Thu, May 24, 2018 at 2:39 PM, Shantanu Deshmukh <
> > > >> > > shantanu...@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Someone please help me. I am suffering due to this issue
> > since a
> > > >> long
> > > >> > > > time
> > > >> > > > > and not finding any solution.
> > > >> > > > >
> > > >> > > > > On Wed, May 23, 2018 at 3:48 PM Shantanu Deshmukh <
> > > >> > > shantanu...@gmail.com
> > > >> > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > We have a 3 broker Kafka 0.10.0.1 cluster. There we have 3
> > > >> topics
> > > >> > > with
> > > >> > > > 10
> > > >> > > > > > partitions each. We have an application which spawns
> threads
> > > as
> > > >> > > > > consumers.
> > > >> > > > > > We spawn 5 consumers for each topic. I am observing that
> > > >> consider
> > > >> > > group
> > > >> > > > > > randomly keeps rebalancing. Then many times we see logs
> > saying
> > > >> > > > "Revoking
> > > >> > > > > > partitions for". This happens almost every 10 minutes.
> > > >> Consumption
> > > >> > > > during
> > > >> > > > > > this time completely stops.
> > > >> > > > > >
> > > >> > > > > > I have applied this configuration
> > > >> > > > > > max.poll.records 20
> > > >> > > > > > heartbeat.interval.ms 10000
> > > >> > > > > > Session.timeout.ms 6000
> > > >> > > > > >
> > > >> > > > > > Still this did not help. Strange thing is I observed
> > consumer
> > > >> > writing
> > > >> > > > > logs
> > > >> > > > > > saying "auto commit failed because poll() loop spent too
> > much
> > > >> time
> > > >> > > > > > processing records" even when there was no data in
> partition
> > > to
> > > >> > > > process.
> > > >> > > > > We
> > > >> > > > > > have polling interval of 500 ms, specified as argument in
> > > >> poll().
> > > >> > > > > Initially
> > > >> > > > > > I had set same consumer group for all three topics'
> > consumers.
> > > >> > Then I
> > > >> > > > > > specified different CGs for different topics' consumers.
> > Even
> > > >> this
> > > >> > is
> > > >> > > > not
> > > >> > > > > > helping.
> > > >> > > > > >
> > > >> > > > > > I am trying to search over the web, checked my code, tried
> > > many
> > > >> > > > > > combinations of configuration but still no luck. Please
> help
> > > me.
> > > >> > > > > >
> > > >> > > > > > *Thanks & Regards,*
> > > >> > > > > >
> > > >> > > > > > *Shantanu Deshmukh*
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Frequent consumer rebalance, auto commit failures

Reply via email to