Re: Debugging Kafka Streams Windowing

Mahendra Kariya Tue, 16 May 2017 20:59:50 -0700

I am confused. If what you have mentioned is the case, then

   - Why would restarting the stream processes resolve the issue?
   - Why do we get these infinite stream of exceptions only on some boxes
   in the cluster and not all?
   - We have tens of other consumers running just fine. We see this issue
   only in the streams one.





On Tue, May 16, 2017 at 3:36 PM, Guozhang Wang <wangg...@gmail.com> wrote:

> Sorry I mis-read your email and confused it with another thread.
>
> As for your observed issue, it seems "broker-05:6667" is in an unstable
> state which is the group coordinator for this stream process app with app
> id (i.e. group id) "grp_id". Since the streams app cannot commit offsets
> anymore due to group coordinator not available, it cannot proceed but
> repeatedly re-discovers the coordinator.
>
> This is not generally an issue for streams, but for consumer group
> membership management. In practice you need to make sure that the offset
> topic is replicate (I think by default it is 3 replicas) so that whenever
> the leader of a certain offset topic partition, hence the group
> coordinator, fails, another broker can take over so that any consumer
> groups that is corresponding to that offset topic partition won't be
> blocked.
>
>
> Guozhang
>
>
>
> On Mon, May 15, 2017 at 7:33 PM, Mahendra Kariya <
> mahendra.kar...@go-jek.com
> > wrote:
>
> > Thanks for the reply Guozhang! But I think we are talking of 2 different
> > issues here. KAFKA-5167 is for LockException. We face this issue
> > intermittently, but not a lot.
> >
> > There is also another issue where a particular broker is marked as dead
> for
> > a group id and Streams process never recovers from this exception.
> >
> > On Mon, May 15, 2017 at 11:28 PM, Guozhang Wang <wangg...@gmail.com>
> > wrote:
> >
> > > I'm wondering if it is possibly due to KAFKA-5167? In that case, the
> > "other
> > > thread" will keep retrying on grabbing the lock.
> > >
> > > Guozhang
> > >
> > >
> > > On Sat, May 13, 2017 at 7:30 PM, Mahendra Kariya <
> > > mahendra.kar...@go-jek.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > There is no missing data. But the INFO level logs are infinite and
> the
> > > > streams practically stops. For the messages that I posted, we got
> these
> > > > INFO logs for around 20 mins. After which we got an alert about no
> data
> > > > being produced in the sink topic and we had to restart the streams
> > > > processes.
> > > >
> > > >
> > > >
> > > > On Sun, May 14, 2017 at 1:01 AM, Matthias J. Sax <
> > matth...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I just dug a little bit. The messages are logged at INFO level and
> > thus
> > > > > should not be a problem if they go away by themselves after some
> > time.
> > > > > Compare:
> > > > > https://groups.google.com/forum/#!topic/confluent-
> > platform/A14dkPlDlv4
> > > > >
> > > > > Do you still see missing data?
> > > > >
> > > > >
> > > > > -Matthias
> > > > >
> > > > >
> > > > > On 5/11/17 2:39 AM, Mahendra Kariya wrote:
> > > > > > Hi Matthias,
> > > > > >
> > > > > > We faced the issue again. The logs are below.
> > > > > >
> > > > > > 16:13:16.527 [StreamThread-7] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) dead
> > > > > for
> > > > > > group grp_id
> > > > > > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) for
> > > > > group
> > > > > > grp_id.
> > > > > > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) dead
> > > > > for
> > > > > > group grp_id
> > > > > > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) for
> > > > > group
> > > > > > grp_id.
> > > > > > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) dead
> > > > > for
> > > > > > group grp_id
> > > > > > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) for
> > > > > group
> > > > > > grp_id.
> > > > > > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) dead
> > > > > for
> > > > > > group grp_id
> > > > > > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) for
> > > > > group
> > > > > > grp_id.
> > > > > > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) dead
> > > > > for
> > > > > > group grp_id
> > > > > > 16:13:16.573 [StreamThread-2] INFO o.a.k.c.c.i.
> AbstractCoordinator
> > -
> > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > <(214)%20748-3642> rack: null) for
> > > > > group
> > > > > > grp_id.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, May 9, 2017 at 3:40 AM, Matthias J. Sax <
> > > matth...@confluent.io
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Great! Glad 0.10.2.1 fixes it for you!
> > > > > >>
> > > > > >> -Matthias
> > > > > >>
> > > > > >> On 5/7/17 8:57 PM, Mahendra Kariya wrote:
> > > > > >>> Upgrading to 0.10.2.1 seems to have fixed the issue.
> > > > > >>>
> > > > > >>> Until now, we were looking at random 1 hour data to analyse the
> > > > issue.
> > > > > >> Over
> > > > > >>> the weekend, we have written a simple test that will
> continuously
> > > > check
> > > > > >> for
> > > > > >>> inconsistencies in real time and report if there is any issue.
> > > > > >>>
> > > > > >>> No issues have been reported for the last 24 hours. Will update
> > > this
> > > > > >> thread
> > > > > >>> if we find any issue.
> > > > > >>>
> > > > > >>> Thanks for all the support!
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Fri, May 5, 2017 at 3:55 AM, Matthias J. Sax <
> > > > matth...@confluent.io
> > > > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> About
> > > > > >>>>
> > > > > >>>>> 07:44:08.493 [StreamThread-10] INFO
> > > o.a.k.c.c.i.AbstractCoordinato
> > > > r
> > > > > -
> > > > > >>>>> Discovered coordinator broker-05:6667 for group group-2.
> > > > > >>>>
> > > > > >>>> Please upgrade to Streams 0.10.2.1 -- we fixed couple of bug
> > and I
> > > > > would
> > > > > >>>> assume this issue is fixed, too. If not, please report back.
> > > > > >>>>
> > > > > >>>>> Another question that I have is, is there a way for us detect
> > how
> > > > > many
> > > > > >>>>> messages have come out of order? And if possible, what is the
> > > > delay?
> > > > > >>>>
> > > > > >>>> There is no metric or api for this. What you could do though
> is,
> > > to
> > > > > use
> > > > > >>>> #transform() that only forwards each record and as a side
> task,
> > > > > extracts
> > > > > >>>> the timestamp via `context#timestamp()` and does some book
> > keeping
> > > > to
> > > > > >>>> compute if out-of-order and what the delay was.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>>>>  - same for .mapValues()
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>> I am not sure how to check this.
> > > > > >>>>
> > > > > >>>> The same way as you do for filter()?
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> -Matthias
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 5/4/17 10:29 AM, Mahendra Kariya wrote:
> > > > > >>>>> Hi Matthias,
> > > > > >>>>>
> > > > > >>>>> Please find the answers below.
> > > > > >>>>>
> > > > > >>>>> I would recommend to double check the following:
> > > > > >>>>>>
> > > > > >>>>>>  - can you confirm that the filter does not remove all data
> > for
> > > > > those
> > > > > >>>>>> time periods?
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> Filter does not remove all data. There is a lot of data
> coming
> > in
> > > > > even
> > > > > >>>>> after the filter stage.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>>  - I would also check input for your AggregatorFunction() --
> > > does
> > > > it
> > > > > >>>>>> receive everything?
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> Yes. Aggregate function seems to be receiving everything.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>>  - same for .mapValues()
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> I am not sure how to check this.
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: Debugging Kafka Streams Windowing

Reply via email to