Another interesting datapoint:

Taking a deeper look at partition 21:

brann@kafka1:/data/kafka/logs/__consumer_offsets-21$ ls -la
total 20176
drwxr-xr-x    2 kafka kafka     4096 Dec 15 08:11 .
drwxr-xr-x 1605 kafka kafka   167936 Dec 15 08:31 ..
-rw-r--r--    1 kafka kafka        0 Dec 15 08:03 00000000000000000000.index
-rw-r--r--    1 kafka kafka    20924 Dec 15 04:08 00000000000000000000.log
-rw-r--r--    1 kafka kafka       12 Dec 15 08:03 00000000000000000000.timeindex
-rw-r--r--    1 kafka kafka        0 Dec 15 08:03 00000000000200248390.index
-rw-r--r--    1 kafka kafka     3662 Dec 15 07:53 00000000000200248390.log
-rw-r--r--    1 kafka kafka       10 Dec 15 04:08 00000000000200248390.snapshot
-rw-r--r--    1 kafka kafka       12 Dec 15 08:03 00000000000200248390.timeindex
-rw-r--r--    1 kafka kafka       10 Dec 15 04:36 00000000000200349263.snapshot
-rw-r--r--    1 kafka kafka       10 Dec 15 06:01 00000000000200650903.snapshot
-rw-r--r--    1 kafka kafka 10485760 Dec 15 08:31 00000000000201167309.index
-rw-r--r--    1 kafka kafka 15874568 Dec 15 08:31 00000000000201167309.log
-rw-r--r--    1 kafka kafka       10 Dec 15 07:53 00000000000201167309.snapshot
-rw-r--r--    1 kafka kafka 10485756 Dec 15 08:31 00000000000201167309.timeindex
-rw-r--r--    1 kafka kafka      164 Dec 15 08:11 leader-epoch-checkpoint

*Why are there extra .snapsho files in there?*

these two are before and after the offset error that is mentioned

-rw-r--r--    1 kafka kafka       10 Dec 15 04:36 00000000000200349263.snapshot
-rw-r--r--    1 kafka kafka       10 Dec 15 06:01 00000000000200650903.snapshot


Taking a deeper look into the log file:

Taking a deeper look into the log file...

__consumer_offsets-21$ /data/kafka/kafka/bin/kafka-run-class.sh
kafka.tools.DumpLogSegments   --deep-iteration --files
00000000000200248390.log | less

Dumping 00000000000200248390.log
Starting offset: 200248390
offset: 201167253 position: 0 CreateTime: 1513324424113 isvalid: true
keysize: 75 valuesize: 28 magic: 2 compresscodec: NONE producerId: -1
producerEpoch: -1 sequence: 1 isTransactional: false headerKeys: []
offset: 201167258 position: 0 CreateTime: 1513324424113 isvalid: true
keysize: 75 valuesize: 28 magic: 2 compresscodec: NONE producerId: -1
producerEpoch: -1 sequence: 6 isTransactional: false headerKeys: []
...

Interestingly, for partition 21 the error is about offset 200349244. I'd
expect that to be in 00000000000200248390.log, but when dumping that log
file manually it starts with offset 201167266.


Still stumped on:
why is it erroring about 201167266
why cant' I find 201167266 in the log file for that partition (I've grepped
the output of DumpLogSegments for all 3 log files. it's not there.)
Why is it trying 201167266? Is it from the snapshot files? Is there some
surgery we can do to make it stop. Safely? :)

On Fri, Dec 15, 2017 at 4:33 PM, Brett Rann <br...@zendesk.com> wrote:

> on `kafka_2.11-1.0.1-d04daf570` we are upgrading the log format from
> 0.9.0.1 to 0.11.0.1 and after the upgrade have set
>
> inter.broker.protocol.version=1.0
> log.message.format.version=0.11.0.1
>
> We have applied this upgrade to 5 clusters by upgrading broker 1, leaving
> it for a day, then coming back when happy to upgrade the remaining brokers.
>
>
> 4 of those upgrades went without issue.
>
>
> However in one, when we upgraded the remaining brokers, we now start
> seeing these errors on broker1:
>
> Error processing fetch operation on partition __consumer_offsets-21  offset 
> 200349244
>
> For 4 consumer offset partitions, all which happen to be led by 1.
>
> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 
> 2017-12-15T07:39:40.380+0000 [ReplicaManager broker=1] Error processing fetch 
> operation on partition __consumer_offsets-21  offset 200349244
> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 
> 2017-12-15T07:39:40.381+0000 [ReplicaManager broker=1] Error processing fetch 
> operation on partition __consumer_offsets-11  offset 188709568
> kafka-request-handler-3 72 ERROR kafka.server.ReplicaManager 
> 2017-12-15T07:39:40.381+0000 [ReplicaManager broker=1] Error processing fetch 
> operation on partition __consumer_offsets-1  offset 2045483676
> kafka-request-handler-5 74 ERROR kafka.server.ReplicaManager 
> 2017-12-15T07:39:41.672+0000 [ReplicaManager broker=1] Error processing fetch 
> operation on partition __consumer_offsets-31  offset 235294887
>
> Every second or so.
>
> If we stop that broker, those errors simply shift to the next leader for
> those 4 partitions. And moving the partitions to completely new brokers
> just moves the errors.
>
> We only see this on kafka1. not the other 9 brokers which had the log
> message fromat upgraded a day or two later.
>
> Any suggestion on how to proceed? I'm not even sure yet if this is
> isolated to the cluster, or if it's related to a consumer misbehaving.
> Since our multiple clusters /should/ have the same set of
> producers/consumers working on them, I'm doubtful that it's a misbehaving
> client.
>

Reply via email to