I would like to add a little more to this context, the problem is not hard
to reproduce.

If you are using

   - auto commit
   - heartbeat time = commit time
   - more than one consumer

It seems that is always failing to send the heart beat. Changing the values
for the heartbeat and commit to be different values in a way that you won't
normally commit while you are trying to heartbeat, the consumer works fine.

On 26 March 2016 at 09:36, Zaiming Shi <zmst...@gmail.com> wrote:

> Hi Jason
>
> Thanks for looking into this.
>
> Created this: https://issues.apache.org/jira/browse/KAFKA-3470
> First time do it. Not sure if I have followed any necessary convention
> there.
>
> To your question:
> No, this should not be a *significant* problem for any client as there
> are workarounds (commit less, increase session timeout etc.)
> However, since heartbeats and commits are sent in the same socket,
> (especially when network latency is high), intensive commit requests
> (sync or async) may steal heartbeats' time slots.
> Fixing this should improv group stability for all clients.
>
> Regards
> -Zaiming
>
> On Sat, Mar 26, 2016 at 12:16 AM, Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Zaiming,
> >
> > Yeah, you're right. Changing coordinator won't cause a rebalance (it
> hasn't
> > been that way since we added group metadata persistence). I went back and
> > checked the code and we actually do not reset the heartbeat timer when a
> > commit is received. I'm not sure whether there's a good reason for that,
> > but nothing is coming to mind. At least when the group is stable, the
> > commit could be treated as an implicit heartbeat. Feel free to create a
> > JIRA and we can see what others think. Out of curiosity, is this a
> > significant problem for the Erlang client you're writing?
> >
> > -Jason
> >
> > On Fri, Mar 25, 2016 at 1:38 PM, Zaiming Shi <zmst...@gmail.com> wrote:
> >
> > > Hi Jason
> > >
> > > If I understand correctly, when coordinator is changed the consumer
> > > should get 'NotCoordinatorForGroup' exception not
> 'IllegalGenerationId'.
> > > Topic metadata change? like number of partitions changed ?
> > > I was testing it in a pretty stable cluster, and it was reproduced
> > several
> > > times,
> > > I had no such issue if we change session timeout to 3 minutes.
> > > --- does this rule out the topic metadata change?
> > >
> > > The logs are lost because I was running debug mode in our Erlang client
> > to
> > > help debugging this issue for my colleague who's using the new Java
> > client.
> > > My colleague has observed very likely the same pattern as I described
> > > above.
> > > He is trying to get on hold a minimal setup for a reliable
> reproduction.
> > >
> > > I will also try to reproduce it in Erlang, and post here a (hopefully
> > > sensible)
> > > sequence of timestamped heartbeat and commit requests and responses.
> > >
> > > Will ask more questions if we have new findings.
> > >
> > > Regards
> > > -Zaiming
> > >
> > >
> > >
> > > On Fri, Mar 25, 2016 at 5:43 PM, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Zaiming,
> > > >
> > > > It rules out the most likely cause of rebalance, but not the only
> one.
> > > > Rebalances can also be caused by a topic metadata change or a
> > coordinator
> > > > change. Can you post some logs from the consumer around the time that
> > the
> > > > unexpected rebalance occurred?
> > > >
> > > > -Jason
> > > >
> > > > On Fri, Mar 25, 2016 at 12:09 AM, Zaiming Shi <zmst...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Jason
> > > > >
> > > > > thanks for the reply!
> > > > >
> > > > > Forgot to mention that in we tried to test the simplest scenario in
> > > which
> > > > > there was only one member in the group. I think that should rule
> out
> > > > group
> > > > >  rebalancing right?
> > > > >
> > > > > On Thursday, March 24, 2016, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > > > >
> > > > > > HI Zaiming,
> > > > > >
> > > > > > I think the problem is not that commit requests aren't considered
> > as
> > > > > > effective as heartbeats (they are), but that you can't rejoin the
> > > group
> > > > > > using only commits/heartbeats. Every time the group rebalances,
> all
> > > > > members
> > > > > > must rejoin the group by sending a JoinGroup request. Once a
> > > rebalance
> > > > > has
> > > > > > begun (e.g. because a new consumer has been started), then each
> > > member
> > > > > must
> > > > > > send the JoinGroup before expiration of the session timeout. If
> > not,
> > > > then
> > > > > > they will be kicked out of the group even if they are still
> sending
> > > > > > heartbeats. Does that make sense?
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 23, 2016 at 10:03 AM, Zaiming Shi <zmst...@gmail.com
> > > > > > <javascript:;>> wrote:
> > > > > >
> > > > > > > Hi there!
> > > > > > >
> > > > > > > We have noticed that when committing requests are sent
> > intensively,
> > > > we
> > > > > > > receive IllegalGenerationId.
> > > > > > > Here is the settings we had problem with: session-timeout: 30
> > sec,
> > > > > > > heartbeat-rate: 3 sec.
> > > > > > > Problem resolved by increasing the session timeout to 180 sec.
> > > > > > >
> > > > > > > So I suppose, due to whatever reason (either the client didn't
> > send
> > > > > > > heartbeat, or the broker didn't process the heartbeats in
> time),
> > > the
> > > > > > > session was considered dead in group coordinator.
> > > > > > >
> > > > > > > My question is: why commit requests can't be taken as an
> > indicator
> > > of
> > > > > > > member being alive? hence not to kill the session.
> > > > > > >
> > > > > > > Regards
> > > > > > > -Zaiming
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to