Hi Guozhang,

OK, I spent some time to understand a bit more how Kafka uses ZooKeeper and
how sessions are handled and it seems that the change you proposed should
do the job. Thanks :-)

But I still think that (optional?) automatic restart of a consumer could be
a good idea! ;-)

M.



Kind regards,
Michał Michalski,
michal.michal...@boxever.com


On 11 July 2014 16:18, Guozhang Wang <wangg...@gmail.com> wrote:

> Hi Michal,
>
> In your case you could try to increase the zookeeper session timeout value
> on the consumer side (default is 6 sec) and see if this is sufficient to
> cover the latency jitters.
>
> Guozhang
>
>
> On Fri, Jul 11, 2014 at 5:25 AM, Michal Michalski <
> michal.michal...@boxever.com> wrote:
>
> > Hey Guozhang,
> >
> > Thanks for reply. I get your point on "hiding" some issues, but I'd
> prefer
> > to separate the recovery and reporting a failure. Also, I think if simple
> > restart is a possible solution, it shouldn't require implementing it
> > separately or, what's even worse, a manual intervention. Maybe I'll
> > describe my problem then to show you my point of view:
> >
> > ZK latency spiked for few seconds making ZK effectively dead from
> > consumers' point of view. Then they all reconnected. As I understand,
> when
> > it happened, it caused rebalancing. Some consumer groups succeeded, but
> > then another spike in latency happened and - as we suspect - it caused
> > rebalancing to fail, because creation of that ZK node failed at some
> point.
> > Ideally, I'd like to get notified about that problem (rebalancing failed
> > after X retries etc.), so I know there is an issue and I can investigate
> > it, but then I'd like Kafka consumer (or my app) to fallback to restart,
> > which could *possibly* make consumer recover. If not - that's my problem
> > then ;-)
> >
> > In our case it was enough to restart the app to get consumer working
> again,
> > but - as we didn't know about that behaviour before and we weren't
> prepared
> > for it - it required manual intervention (on Friday night, which made it
> > even more painful ;> ) which, we believe, wasn't necessary in that case
> and
> > could have been handled automatically.
> >
> > M.
> >
> >
> >
> > Kind regards,
> > Michał Michalski,
> > michal.michal...@boxever.com
> >
> >
> > On 10 July 2014 23:43, Guozhang Wang <wangg...@gmail.com> wrote:
> >
> > > Hi Michal,
> > >
> > > The rebalance will only be triggered on consumer membership or
> > > topic/partition changes. Once triggered it will try to finish the
> > rebalance
> > > for at most rebalance.max.retries times, i.e. if it fails it will wait
> > for
> > > rebalance.backoff.ms, and then try again until number of retries
> > > exhausted.
> > > When it happens an exception will be thrown and the consumer may be
> > fallen
> > > to a bad state.
> > >
> > > Then reason we did not implement automatic restart upon rebalance
> > failures
> > > is that it may actually "hide" some issues in the systems that actually
> > > caused the rebalance failure. The general design is that if some
> > > exception/errors are not expected like the rebalance failures we will
> let
> > > it to possibly hault/kill the instance rather than automatically
> restart
> > > and let it go.
> > >
> > > Guozhang
> > >
> > >
> > >
> > >
> > > On Thu, Jul 10, 2014 at 2:24 AM, Michal Michalski <
> > > michal.michal...@boxever.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Just wondering - is there any reason why rebalance.max.retries is 4
> by
> > > > default? Is there any good reason why I shouldn't expect my consumers
> > to
> > > > keep trying to rebalance for minutes (e.g. 30 retries every 6
> seconds),
> > > > rather than seconds (4 retries every 2 seconds by default)?
> > > >
> > > > Also, if my consumer fails to rebalance because of NoNodeException
> > > > (org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode
> > =
> > > > NoNode for
> > /consumers/is-entity-modified-document-group/ids/<something>)
> > > > wouldn't that make sense to make Kafka restart it automatically once
> it
> > > > "uses" all the retries attempts? Or recreate the inexistent ZK node
> > > like, I
> > > > believe, it will happen on consumer restart?
> > > >
> > > > I'm asking because that kind of errors seem to be "recoverable" ones,
> > > but -
> > > > if I understand it correctly - with current design they require
> > > > implementing additional mechanisms or manual intervention.
> > > >
> > > >
> > > > Kind regards,
> > > > Michał
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Reply via email to