I would like to add a little more to this context, the problem is not hard to reproduce.
If you are using - auto commit - heartbeat time = commit time - more than one consumer It seems that is always failing to send the heart beat. Changing the values for the heartbeat and commit to be different values in a way that you won't normally commit while you are trying to heartbeat, the consumer works fine. On 26 March 2016 at 09:36, Zaiming Shi <zmst...@gmail.com> wrote: > Hi Jason > > Thanks for looking into this. > > Created this: https://issues.apache.org/jira/browse/KAFKA-3470 > First time do it. Not sure if I have followed any necessary convention > there. > > To your question: > No, this should not be a *significant* problem for any client as there > are workarounds (commit less, increase session timeout etc.) > However, since heartbeats and commits are sent in the same socket, > (especially when network latency is high), intensive commit requests > (sync or async) may steal heartbeats' time slots. > Fixing this should improv group stability for all clients. > > Regards > -Zaiming > > On Sat, Mar 26, 2016 at 12:16 AM, Jason Gustafson <ja...@confluent.io> > wrote: > > > Hi Zaiming, > > > > Yeah, you're right. Changing coordinator won't cause a rebalance (it > hasn't > > been that way since we added group metadata persistence). I went back and > > checked the code and we actually do not reset the heartbeat timer when a > > commit is received. I'm not sure whether there's a good reason for that, > > but nothing is coming to mind. At least when the group is stable, the > > commit could be treated as an implicit heartbeat. Feel free to create a > > JIRA and we can see what others think. Out of curiosity, is this a > > significant problem for the Erlang client you're writing? > > > > -Jason > > > > On Fri, Mar 25, 2016 at 1:38 PM, Zaiming Shi <zmst...@gmail.com> wrote: > > > > > Hi Jason > > > > > > If I understand correctly, when coordinator is changed the consumer > > > should get 'NotCoordinatorForGroup' exception not > 'IllegalGenerationId'. > > > Topic metadata change? like number of partitions changed ? > > > I was testing it in a pretty stable cluster, and it was reproduced > > several > > > times, > > > I had no such issue if we change session timeout to 3 minutes. > > > --- does this rule out the topic metadata change? > > > > > > The logs are lost because I was running debug mode in our Erlang client > > to > > > help debugging this issue for my colleague who's using the new Java > > client. > > > My colleague has observed very likely the same pattern as I described > > > above. > > > He is trying to get on hold a minimal setup for a reliable > reproduction. > > > > > > I will also try to reproduce it in Erlang, and post here a (hopefully > > > sensible) > > > sequence of timestamped heartbeat and commit requests and responses. > > > > > > Will ask more questions if we have new findings. > > > > > > Regards > > > -Zaiming > > > > > > > > > > > > On Fri, Mar 25, 2016 at 5:43 PM, Jason Gustafson <ja...@confluent.io> > > > wrote: > > > > > > > Hi Zaiming, > > > > > > > > It rules out the most likely cause of rebalance, but not the only > one. > > > > Rebalances can also be caused by a topic metadata change or a > > coordinator > > > > change. Can you post some logs from the consumer around the time that > > the > > > > unexpected rebalance occurred? > > > > > > > > -Jason > > > > > > > > On Fri, Mar 25, 2016 at 12:09 AM, Zaiming Shi <zmst...@gmail.com> > > wrote: > > > > > > > > > Hi Jason > > > > > > > > > > thanks for the reply! > > > > > > > > > > Forgot to mention that in we tried to test the simplest scenario in > > > which > > > > > there was only one member in the group. I think that should rule > out > > > > group > > > > > rebalancing right? > > > > > > > > > > On Thursday, March 24, 2016, Jason Gustafson <ja...@confluent.io> > > > wrote: > > > > > > > > > > > HI Zaiming, > > > > > > > > > > > > I think the problem is not that commit requests aren't considered > > as > > > > > > effective as heartbeats (they are), but that you can't rejoin the > > > group > > > > > > using only commits/heartbeats. Every time the group rebalances, > all > > > > > members > > > > > > must rejoin the group by sending a JoinGroup request. Once a > > > rebalance > > > > > has > > > > > > begun (e.g. because a new consumer has been started), then each > > > member > > > > > must > > > > > > send the JoinGroup before expiration of the session timeout. If > > not, > > > > then > > > > > > they will be kicked out of the group even if they are still > sending > > > > > > heartbeats. Does that make sense? > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 23, 2016 at 10:03 AM, Zaiming Shi <zmst...@gmail.com > > > > > > <javascript:;>> wrote: > > > > > > > > > > > > > Hi there! > > > > > > > > > > > > > > We have noticed that when committing requests are sent > > intensively, > > > > we > > > > > > > receive IllegalGenerationId. > > > > > > > Here is the settings we had problem with: session-timeout: 30 > > sec, > > > > > > > heartbeat-rate: 3 sec. > > > > > > > Problem resolved by increasing the session timeout to 180 sec. > > > > > > > > > > > > > > So I suppose, due to whatever reason (either the client didn't > > send > > > > > > > heartbeat, or the broker didn't process the heartbeats in > time), > > > the > > > > > > > session was considered dead in group coordinator. > > > > > > > > > > > > > > My question is: why commit requests can't be taken as an > > indicator > > > of > > > > > > > member being alive? hence not to kill the session. > > > > > > > > > > > > > > Regards > > > > > > > -Zaiming > > > > > > > > > > > > > > > > > > > > > > > > > > > >