Do you think you can reproduce this easily?

Thanks,

Jun


On Fri, Jan 10, 2014 at 11:33 AM, Jason Rosenberg <j...@squareup.com> wrote:

> well, not currently, as we don't have multiple partitions for the
> topics.....but yes, I understand that would help too....
>
> but, we are using this multiple consumers within a process approach in
> general with much success so far......just was curious about this ERROR I
> was seeing :)
>
>
> On Fri, Jan 10, 2014 at 11:06 AM, Jun Rao <jun...@gmail.com> wrote:
>
> > Could you increase parallelism on the consumers?
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Thu, Jan 9, 2014 at 1:22 PM, Jason Rosenberg <j...@squareup.com>
> wrote:
> >
> > > The consumption rate is a little better after the refactoring.  The
> main
> > > issue though, was that we had a mismatch between large and small
> topics.
> >  A
> > > large topic can lag, and adversely affect consumption of other topics,
> so
> > > this is an attempt to isolate topic filtering, and better balance the
> > > consumers for the different topics.
> > >
> > > So, it's definitely working on that score.
> > >
> > > The topic that was lagging (and getting OffsetOutOfRangeExceptions) was
> > > doing that before and after the refactor (and after we started also
> > seeing
> > > the ERROR logging).  But consumption of all other topics is working
> > better
> > > now (almost no lag at all).
> > >
> > > I'm also setting the client.id for each consumer in the process, so
> > that I
> > > can see the individual metrics per consumer.
> > >
> > > Jason
> > >
> > >
> > > On Thu, Jan 9, 2014 at 1:00 PM, Jun Rao <jun...@gmail.com> wrote:
> > >
> > > > Does the consumption rate in the client (msg/sec) change
> significantly
> > > > after the refactoring?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Wed, Jan 8, 2014 at 10:44 AM, Jason Rosenberg <j...@squareup.com>
> > > wrote:
> > > >
> > > > > Yes, it's happening continuously, at the moment (although I'm
> > expecting
> > > > the
> > > > > consumer to catch up soon)....
> > > > >
> > > > > It seemed to start happening after I refactored the consumer app to
> > use
> > > > > multiple consumer connectors in the same process (each one has a
> > > separate
> > > > > topic filter, so should be no overlap between them).  All using the
> > > same
> > > > > consumer group.
> > > > >
> > > > > Could it be a thread safety issue in the ZookeeperConsumerConnector
> > > > (seems
> > > > > unlikely).
> > > > >
> > > > > Jason
> > > > >
> > > > >
> > > > > On Wed, Jan 8, 2014 at 1:04 AM, Jun Rao <jun...@gmail.com> wrote:
> > > > >
> > > > > > Normally, if the consumer can't keep up, you should just see the
> > > > > > OffsetOutOfRangeException warning. The offset mismatch error
> should
> > > > never
> > > > > > happen. It could be that OffsetOutOfRangeException exposed a bug.
> > Do
> > > > you
> > > > > > think you can reproduce this easily?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > >
> > > > > > On Tue, Jan 7, 2014 at 9:29 PM, Jason Rosenberg <
> j...@squareup.com>
> > > > > wrote:
> > > > > >
> > > > > > > Jun,
> > > > > > >
> > > > > > > I'm not sure I understand your question, wrt produced data?
> > > > > > >
> > > > > > > But yes, in general, I believe the consumer is not keeping up
> > with
> > > > the
> > > > > > > broker's deleting the data.  So it's trying to fetch the next
> > batch
> > > > of
> > > > > > > data, but it's last offset is no longer there, etc.  So that's
> > the
> > > > > reason
> > > > > > > for the WARN message, in the fetcher thread.
> > > > > > >
> > > > > > > I'm just not sure I understand then why we don't always see the
> > > > > > > ConsumerIterator error also, because won't there always be
> > missing
> > > > data
> > > > > > > detected there?  Why sometimes and not always?  What's the
> > > > difference?
> > > > > > >
> > > > > > > Jason
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 8, 2014 at 12:07 AM, Jun Rao <jun...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > The WARN and ERROR may not be completely correlated. Could it
> > be
> > > > that
> > > > > > the
> > > > > > > > consumer is slow and couldn't keep up with the produced data?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jan 7, 2014 at 6:47 PM, Jason Rosenberg <
> > > j...@squareup.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > So, sometimes I just get the WARN from the
> > > ConsumerFetcherThread
> > > > > (as
> > > > > > > > > previously noted, above), e.g.:
> > > > > > > > >
> > > > > > > > > 2014-01-08 02:31:47,394  WARN
> > > > > > [ConsumerFetcherThread-myconsumerapp-11]
> > > > > > > > > consumer.ConsumerFetcherThread -
> > > > > > > > > [ConsumerFetcherThread-myconsumerapp-11], Current offset
> > > > > 16163904970
> > > > > > > > > for partition [mypartition,0] out of range; reset offset to
> > > > > > > > > 16175326044
> > > > > > > > >
> > > > > > > > > More recently, I see these in the following log line (not
> > sure
> > > > why
> > > > > I
> > > > > > > > > didn't see it previously), coming from the
> ConsumerIterator:
> > > > > > > > >
> > > > > > > > > 2014-01-08 02:31:47,681 ERROR [myconsumerthread-0]
> > > > > > > > > consumer.ConsumerIterator - consumed offset: 16163904970
> > > doesn't
> > > > > > match
> > > > > > > > > fetch offset: 16175326044 for mytopic:0: fetched offset =
> > > > > > 16175330598:
> > > > > > > > > consumed offset = 16163904970;
> > > > > > > > >  Consumer may lose data
> > > > > > > > >
> > > > > > > > > Why would I not see this second ERROR everytime there's a
> > > > > > > > > corresponding WARN on the FetcherThread for an offset
> reset?
> > > > > > > > >
> > > > > > > > > Should I only be concerned about possible lost data if I
> see
> > > the
> > > > > > > > > second ERROR log line?
> > > > > > > > >
> > > > > > > > > Jason
> > > > > > > > >
> > > > > > > > > On Tue, Dec 24, 2013 at 3:49 PM, Jason Rosenberg <
> > > > j...@squareup.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > But I assume this would not be normally you'd want to log
> > > > (every
> > > > > > > > > > incoming producer request?).  Maybe just for debugging?
>  Or
> > > is
> > > > it
> > > > > > > only
> > > > > > > > > > for consumer fetch requests?
> > > > > > > > > >
> > > > > > > > > > On Tue, Dec 24, 2013 at 12:50 PM, Guozhang Wang <
> > > > > > wangg...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >> TRACE is lower than INFO so INFO level request logging
> > would
> > > > > also
> > > > > > be
> > > > > > > > > >> recorded.
> > > > > > > > > >>
> > > > > > > > > >> You can check for "Completed XXX request" in the log
> files
> > > to
> > > > > > check
> > > > > > > > the
> > > > > > > > > >> request info with the correlation id.
> > > > > > > > > >>
> > > > > > > > > >> Guozhang
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> On Mon, Dec 23, 2013 at 10:46 PM, Jason Rosenberg <
> > > > > > j...@squareup.com
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Hmmm, it looks like I'm enabling all logging at INFO,
> and
> > > the
> > > > > > > request
> > > > > > > > > >>> logging is only done at TRACE (why is that?).
> > > > > > > > > >>>
> > > > > > > > > >>> I suppose one wouldn't normally want to see request
> logs,
> > > so
> > > > by
> > > > > > > > > default,
> > > > > > > > > >>> they aren't enabled?
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>> On Mon, Dec 23, 2013 at 11:46 PM, Jun Rao <
> > > jun...@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >>>
> > > > > > > > > >>> > Did you enable request log? It logs the ip of every
> > > > request.
> > > > > > > > > >>> >
> > > > > > > > > >>> > Thanks,
> > > > > > > > > >>> >
> > > > > > > > > >>> > Jun
> > > > > > > > > >>> >
> > > > > > > > > >>> >
> > > > > > > > > >>> > On Mon, Dec 23, 2013 at 3:52 PM, Jason Rosenberg <
> > > > > > > j...@squareup.com
> > > > > > > > >
> > > > > > > > > >>> wrote:
> > > > > > > > > >>> >
> > > > > > > > > >>> > > Hi Guozhang,
> > > > > > > > > >>> > >
> > > > > > > > > >>> > > I'm not sure I understand your first answer.  I
> don't
> > > see
> > > > > > > > anything
> > > > > > > > > >>> > > regarding the correlation id, elsewhere in the
> broker
> > > > > > > > logs.....They
> > > > > > > > > >>> only
> > > > > > > > > >>> > > show up in those ERROR messages....
> > > > > > > > > >>> > >
> > > > > > > > > >>> > > I do see correlation id's in clients, but not on
> the
> > > > > > > broker.....
> > > > > > > > > >>> > >
> > > > > > > > > >>> > > Jason
> > > > > > > > > >>> > >
> > > > > > > > > >>> > >
> > > > > > > > > >>> > > On Mon, Dec 23, 2013 at 6:46 PM, Guozhang Wang <
> > > > > > > > wangg...@gmail.com
> > > > > > > > > >
> > > > > > > > > >>> > wrote:
> > > > > > > > > >>> > >
> > > > > > > > > >>> > > > Jason,
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > > You can search the correlation id in the public
> > > access
> > > > > log
> > > > > > on
> > > > > > > > the
> > > > > > > > > >>> > servers
> > > > > > > > > >>> > > > to get the consumer information.
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > > As for logging, I agree that we should use the
> same
> > > > level
> > > > > > on
> > > > > > > > both
> > > > > > > > > >>> > sides.
> > > > > > > > > >>> > > > Could you file a jira for this?
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > > Guozhang
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > > On Mon, Dec 23, 2013 at 3:09 PM, Jason Rosenberg
> <
> > > > > > > > > j...@squareup.com>
> > > > > > > > > >>> > > wrote:
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > > > In our broker logs, we occasionally see errors
> > like
> > > > > this:
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > 2013-12-23 05:02:08,456 ERROR
> > > > [kafka-request-handler-2]
> > > > > > > > > >>> > > server.KafkaApis
> > > > > > > > > >>> > > > -
> > > > > > > > > >>> > > > > [KafkaApi-45] Error when processing fetch
> request
> > > for
> > > > > > > > partition
> > > > > > > > > >>> > > > [mytopic,0]
> > > > > > > > > >>> > > > > offset 204243601 from consumer with correlation
> > id
> > > > > 130341
> > > > > > > > > >>> > > > > kafka.common.OffsetOutOfRangeException: Request
> > for
> > > > > > offset
> > > > > > > > > >>> 204243601
> > > > > > > > > >>> > > but
> > > > > > > > > >>> > > > we
> > > > > > > > > >>> > > > > only have log segments in the range 204343397
> to
> > > > > > 207423640.
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > I assume this means there's a consumer that has
> > > > fallen
> > > > > > > behind
> > > > > > > > > >>> > consuming
> > > > > > > > > >>> > > > > messages, and the log retention policy has
> > removed
> > > > > > messages
> > > > > > > > > before
> > > > > > > > > >>> > they
> > > > > > > > > >>> > > > > could be consumed by the consumer.
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > However, I'm not 100% which consumer it is, and
> > it
> > > > > looks
> > > > > > > like
> > > > > > > > > the
> > > > > > > > > >>> > only
> > > > > > > > > >>> > > > info
> > > > > > > > > >>> > > > > we have is the correlation id of the consumer,
> > > e.g.:
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > "from consumer with correlation id 130341"
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > Is there a way to know which consumer this
> refers
> > > to?
> > > > >  It
> > > > > > > > seems
> > > > > > > > > >>> there
> > > > > > > > > >>> > > are
> > > > > > > > > >>> > > > > far more correlation id's than there are
> > consumers.
> > > > > >  Would
> > > > > > > it
> > > > > > > > > be
> > > > > > > > > >>> > > possible
> > > > > > > > > >>> > > > > to provide a bit more descriptive error message
> > > here,
> > > > > so
> > > > > > we
> > > > > > > > can
> > > > > > > > > >>> > > > immediately
> > > > > > > > > >>> > > > > know which consumer is falling behind?
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > We do see a corresponding entry in the consumer
> > > logs
> > > > > too:
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > 2013-12-23 05:02:08,797  WARN
> > > > > > > > > >>> > > > >
> > > > > > > > >
> > > > [ConsumerFetcherThread-myconsumergroup-1387353494862-7aa0c61d-0-45]
> > > > > > > > > >>> > > > > consumer.ConsumerFetcherThread -
> > > > > > > > > >>> > > > >
> > > > > > > > > >>>
> > > > > > >
> > > [ConsumerFetcherThread-myconsumergroup-1387353494862-7aa0c61d-0-45],
> > > > > > > > > >>> > > > > Current offset 204243601 for partition
> > [mytopic,0]
> > > > out
> > > > > of
> > > > > > > > > range;
> > > > > > > > > >>> > reset
> > > > > > > > > >>> > > > > offset to 204343397
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > But it would be nice to be able to also use the
> > > > broker
> > > > > > log
> > > > > > > to
> > > > > > > > > >>> quickly
> > > > > > > > > >>> > > > find
> > > > > > > > > >>> > > > > consumers with issues.
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > Also, I'm not sure, why is logging the event as
> > an
> > > > > ERROR
> > > > > > in
> > > > > > > > the
> > > > > > > > > >>> > broker,
> > > > > > > > > >>> > > > but
> > > > > > > > > >>> > > > > a WARN in the consumer?
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > > > Jason
> > > > > > > > > >>> > > > >
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > > > --
> > > > > > > > > >>> > > > -- Guozhang
> > > > > > > > > >>> > > >
> > > > > > > > > >>> > >
> > > > > > > > > >>> >
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> --
> > > > > > > > > >> -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to