Do you think you can reproduce this easily? Thanks,
Jun On Fri, Jan 10, 2014 at 11:33 AM, Jason Rosenberg <j...@squareup.com> wrote: > well, not currently, as we don't have multiple partitions for the > topics.....but yes, I understand that would help too.... > > but, we are using this multiple consumers within a process approach in > general with much success so far......just was curious about this ERROR I > was seeing :) > > > On Fri, Jan 10, 2014 at 11:06 AM, Jun Rao <jun...@gmail.com> wrote: > > > Could you increase parallelism on the consumers? > > > > Thanks, > > > > Jun > > > > > > On Thu, Jan 9, 2014 at 1:22 PM, Jason Rosenberg <j...@squareup.com> > wrote: > > > > > The consumption rate is a little better after the refactoring. The > main > > > issue though, was that we had a mismatch between large and small > topics. > > A > > > large topic can lag, and adversely affect consumption of other topics, > so > > > this is an attempt to isolate topic filtering, and better balance the > > > consumers for the different topics. > > > > > > So, it's definitely working on that score. > > > > > > The topic that was lagging (and getting OffsetOutOfRangeExceptions) was > > > doing that before and after the refactor (and after we started also > > seeing > > > the ERROR logging). But consumption of all other topics is working > > better > > > now (almost no lag at all). > > > > > > I'm also setting the client.id for each consumer in the process, so > > that I > > > can see the individual metrics per consumer. > > > > > > Jason > > > > > > > > > On Thu, Jan 9, 2014 at 1:00 PM, Jun Rao <jun...@gmail.com> wrote: > > > > > > > Does the consumption rate in the client (msg/sec) change > significantly > > > > after the refactoring? > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > > > > > On Wed, Jan 8, 2014 at 10:44 AM, Jason Rosenberg <j...@squareup.com> > > > wrote: > > > > > > > > > Yes, it's happening continuously, at the moment (although I'm > > expecting > > > > the > > > > > consumer to catch up soon).... > > > > > > > > > > It seemed to start happening after I refactored the consumer app to > > use > > > > > multiple consumer connectors in the same process (each one has a > > > separate > > > > > topic filter, so should be no overlap between them). All using the > > > same > > > > > consumer group. > > > > > > > > > > Could it be a thread safety issue in the ZookeeperConsumerConnector > > > > (seems > > > > > unlikely). > > > > > > > > > > Jason > > > > > > > > > > > > > > > On Wed, Jan 8, 2014 at 1:04 AM, Jun Rao <jun...@gmail.com> wrote: > > > > > > > > > > > Normally, if the consumer can't keep up, you should just see the > > > > > > OffsetOutOfRangeException warning. The offset mismatch error > should > > > > never > > > > > > happen. It could be that OffsetOutOfRangeException exposed a bug. > > Do > > > > you > > > > > > think you can reproduce this easily? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > On Tue, Jan 7, 2014 at 9:29 PM, Jason Rosenberg < > j...@squareup.com> > > > > > wrote: > > > > > > > > > > > > > Jun, > > > > > > > > > > > > > > I'm not sure I understand your question, wrt produced data? > > > > > > > > > > > > > > But yes, in general, I believe the consumer is not keeping up > > with > > > > the > > > > > > > broker's deleting the data. So it's trying to fetch the next > > batch > > > > of > > > > > > > data, but it's last offset is no longer there, etc. So that's > > the > > > > > reason > > > > > > > for the WARN message, in the fetcher thread. > > > > > > > > > > > > > > I'm just not sure I understand then why we don't always see the > > > > > > > ConsumerIterator error also, because won't there always be > > missing > > > > data > > > > > > > detected there? Why sometimes and not always? What's the > > > > difference? > > > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 8, 2014 at 12:07 AM, Jun Rao <jun...@gmail.com> > > wrote: > > > > > > > > > > > > > > > The WARN and ERROR may not be completely correlated. Could it > > be > > > > that > > > > > > the > > > > > > > > consumer is slow and couldn't keep up with the produced data? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jan 7, 2014 at 6:47 PM, Jason Rosenberg < > > > j...@squareup.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > So, sometimes I just get the WARN from the > > > ConsumerFetcherThread > > > > > (as > > > > > > > > > previously noted, above), e.g.: > > > > > > > > > > > > > > > > > > 2014-01-08 02:31:47,394 WARN > > > > > > [ConsumerFetcherThread-myconsumerapp-11] > > > > > > > > > consumer.ConsumerFetcherThread - > > > > > > > > > [ConsumerFetcherThread-myconsumerapp-11], Current offset > > > > > 16163904970 > > > > > > > > > for partition [mypartition,0] out of range; reset offset to > > > > > > > > > 16175326044 > > > > > > > > > > > > > > > > > > More recently, I see these in the following log line (not > > sure > > > > why > > > > > I > > > > > > > > > didn't see it previously), coming from the > ConsumerIterator: > > > > > > > > > > > > > > > > > > 2014-01-08 02:31:47,681 ERROR [myconsumerthread-0] > > > > > > > > > consumer.ConsumerIterator - consumed offset: 16163904970 > > > doesn't > > > > > > match > > > > > > > > > fetch offset: 16175326044 for mytopic:0: fetched offset = > > > > > > 16175330598: > > > > > > > > > consumed offset = 16163904970; > > > > > > > > > Consumer may lose data > > > > > > > > > > > > > > > > > > Why would I not see this second ERROR everytime there's a > > > > > > > > > corresponding WARN on the FetcherThread for an offset > reset? > > > > > > > > > > > > > > > > > > Should I only be concerned about possible lost data if I > see > > > the > > > > > > > > > second ERROR log line? > > > > > > > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > On Tue, Dec 24, 2013 at 3:49 PM, Jason Rosenberg < > > > > j...@squareup.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > But I assume this would not be normally you'd want to log > > > > (every > > > > > > > > > > incoming producer request?). Maybe just for debugging? > Or > > > is > > > > it > > > > > > > only > > > > > > > > > > for consumer fetch requests? > > > > > > > > > > > > > > > > > > > > On Tue, Dec 24, 2013 at 12:50 PM, Guozhang Wang < > > > > > > wangg...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > >> TRACE is lower than INFO so INFO level request logging > > would > > > > > also > > > > > > be > > > > > > > > > >> recorded. > > > > > > > > > >> > > > > > > > > > >> You can check for "Completed XXX request" in the log > files > > > to > > > > > > check > > > > > > > > the > > > > > > > > > >> request info with the correlation id. > > > > > > > > > >> > > > > > > > > > >> Guozhang > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> On Mon, Dec 23, 2013 at 10:46 PM, Jason Rosenberg < > > > > > > j...@squareup.com > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > >> > > > > > > > > > >>> Hmmm, it looks like I'm enabling all logging at INFO, > and > > > the > > > > > > > request > > > > > > > > > >>> logging is only done at TRACE (why is that?). > > > > > > > > > >>> > > > > > > > > > >>> I suppose one wouldn't normally want to see request > logs, > > > so > > > > by > > > > > > > > > default, > > > > > > > > > >>> they aren't enabled? > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> On Mon, Dec 23, 2013 at 11:46 PM, Jun Rao < > > > jun...@gmail.com> > > > > > > > wrote: > > > > > > > > > >>> > > > > > > > > > >>> > Did you enable request log? It logs the ip of every > > > > request. > > > > > > > > > >>> > > > > > > > > > > >>> > Thanks, > > > > > > > > > >>> > > > > > > > > > > >>> > Jun > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > > >>> > On Mon, Dec 23, 2013 at 3:52 PM, Jason Rosenberg < > > > > > > > j...@squareup.com > > > > > > > > > > > > > > > > > > >>> wrote: > > > > > > > > > >>> > > > > > > > > > > >>> > > Hi Guozhang, > > > > > > > > > >>> > > > > > > > > > > > >>> > > I'm not sure I understand your first answer. I > don't > > > see > > > > > > > > anything > > > > > > > > > >>> > > regarding the correlation id, elsewhere in the > broker > > > > > > > > logs.....They > > > > > > > > > >>> only > > > > > > > > > >>> > > show up in those ERROR messages.... > > > > > > > > > >>> > > > > > > > > > > > >>> > > I do see correlation id's in clients, but not on > the > > > > > > > broker..... > > > > > > > > > >>> > > > > > > > > > > > >>> > > Jason > > > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > On Mon, Dec 23, 2013 at 6:46 PM, Guozhang Wang < > > > > > > > > wangg...@gmail.com > > > > > > > > > > > > > > > > > > > >>> > wrote: > > > > > > > > > >>> > > > > > > > > > > > >>> > > > Jason, > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > You can search the correlation id in the public > > > access > > > > > log > > > > > > on > > > > > > > > the > > > > > > > > > >>> > servers > > > > > > > > > >>> > > > to get the consumer information. > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > As for logging, I agree that we should use the > same > > > > level > > > > > > on > > > > > > > > both > > > > > > > > > >>> > sides. > > > > > > > > > >>> > > > Could you file a jira for this? > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > Guozhang > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > On Mon, Dec 23, 2013 at 3:09 PM, Jason Rosenberg > < > > > > > > > > > j...@squareup.com> > > > > > > > > > >>> > > wrote: > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > > In our broker logs, we occasionally see errors > > like > > > > > this: > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > 2013-12-23 05:02:08,456 ERROR > > > > [kafka-request-handler-2] > > > > > > > > > >>> > > server.KafkaApis > > > > > > > > > >>> > > > - > > > > > > > > > >>> > > > > [KafkaApi-45] Error when processing fetch > request > > > for > > > > > > > > partition > > > > > > > > > >>> > > > [mytopic,0] > > > > > > > > > >>> > > > > offset 204243601 from consumer with correlation > > id > > > > > 130341 > > > > > > > > > >>> > > > > kafka.common.OffsetOutOfRangeException: Request > > for > > > > > > offset > > > > > > > > > >>> 204243601 > > > > > > > > > >>> > > but > > > > > > > > > >>> > > > we > > > > > > > > > >>> > > > > only have log segments in the range 204343397 > to > > > > > > 207423640. > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > I assume this means there's a consumer that has > > > > fallen > > > > > > > behind > > > > > > > > > >>> > consuming > > > > > > > > > >>> > > > > messages, and the log retention policy has > > removed > > > > > > messages > > > > > > > > > before > > > > > > > > > >>> > they > > > > > > > > > >>> > > > > could be consumed by the consumer. > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > However, I'm not 100% which consumer it is, and > > it > > > > > looks > > > > > > > like > > > > > > > > > the > > > > > > > > > >>> > only > > > > > > > > > >>> > > > info > > > > > > > > > >>> > > > > we have is the correlation id of the consumer, > > > e.g.: > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > "from consumer with correlation id 130341" > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > Is there a way to know which consumer this > refers > > > to? > > > > > It > > > > > > > > seems > > > > > > > > > >>> there > > > > > > > > > >>> > > are > > > > > > > > > >>> > > > > far more correlation id's than there are > > consumers. > > > > > > Would > > > > > > > it > > > > > > > > > be > > > > > > > > > >>> > > possible > > > > > > > > > >>> > > > > to provide a bit more descriptive error message > > > here, > > > > > so > > > > > > we > > > > > > > > can > > > > > > > > > >>> > > > immediately > > > > > > > > > >>> > > > > know which consumer is falling behind? > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > We do see a corresponding entry in the consumer > > > logs > > > > > too: > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > 2013-12-23 05:02:08,797 WARN > > > > > > > > > >>> > > > > > > > > > > > > > > > > > [ConsumerFetcherThread-myconsumergroup-1387353494862-7aa0c61d-0-45] > > > > > > > > > >>> > > > > consumer.ConsumerFetcherThread - > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > > > > > > > [ConsumerFetcherThread-myconsumergroup-1387353494862-7aa0c61d-0-45], > > > > > > > > > >>> > > > > Current offset 204243601 for partition > > [mytopic,0] > > > > out > > > > > of > > > > > > > > > range; > > > > > > > > > >>> > reset > > > > > > > > > >>> > > > > offset to 204343397 > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > But it would be nice to be able to also use the > > > > broker > > > > > > log > > > > > > > to > > > > > > > > > >>> quickly > > > > > > > > > >>> > > > find > > > > > > > > > >>> > > > > consumers with issues. > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > Also, I'm not sure, why is logging the event as > > an > > > > > ERROR > > > > > > in > > > > > > > > the > > > > > > > > > >>> > broker, > > > > > > > > > >>> > > > but > > > > > > > > > >>> > > > > a WARN in the consumer? > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > Jason > > > > > > > > > >>> > > > > > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > -- > > > > > > > > > >>> > > > -- Guozhang > > > > > > > > > >>> > > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> -- > > > > > > > > > >> -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >