https://bugzilla.wikimedia.org/show_bug.cgi?id=71056
Bug ID: 71056
Summary: kafkatee not consuming for some partitions
Product: Analytics
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: Unprioritized
Component: General/Unknown
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected],
[email protected], [email protected],
[email protected], [email protected]
Web browser: ---
Mobile Platform: ---
Below is a paste of an email I sent to Christian and Magnus in an effort to
describe this problem. I will add a comment with further insights.
---
Christian and I are ramping up to use kafkatee more in production. We’re
generating more output using it, and are comparing the output against
udp2log’s.
We recently noticed that for one of our topics (webrequest_upload), there was
8.3% more data in the udp2log output. We currently have 12 partitions, and
1/12 is 8.3%. This made us suspect that one of the partitions was not being
consumed properly. We investigated that, and now have lost of weirdness to
report. Apologies if this is disorganized, there are a lot of moving pieces
here. I’ll try to report as best I can.
= Problem 1. Some offset files are not updating. =
I restarted kafkatee shortly before running the following command:
$ date && ls -ltr /var/cache/kafkatee/offsets/
Wed Sep 17 22:05:31 UTC 2014
total 192
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-9.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-8.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-7.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-6.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-5.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-3.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-4.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-2.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-1.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Jun 11 20:28 webrequest_mobile-0.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 21:32 webrequest_bits-10.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 21:32 webrequest_bits-9.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 21:38 webrequest_upload-10.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 21:47 webrequest_text-10.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Sep 17 21:48 webrequest_mobile-10.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-4.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-2.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-6.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-8.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-0.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-4.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-0.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-6.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-2.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-7.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-3.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-8.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-4.offset
-rw-r--r-- 1 kafkatee kafkatee 11 Sep 17 22:02 webrequest_mobile-11.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-0.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-3.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-11.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-8.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-6.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-2.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-9.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-1.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-1.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_text-5.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-7.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-5.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_bits-11.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-1.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-5.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-9.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-3.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-11.offset
-rw-r--r-- 1 kafkatee kafkatee 12 Sep 17 22:02 webrequest_upload-7.offset
You can see here that some of the offset files have older mtimes. There seem
to be two different cases where the files are not updated.
== Case 1a: Very old offsets for mobile files, but still consuming ==
10 out of 12 of the mobile offset files have not been updated since June. I
turned on verbose kafkatee logging and tried to find some relevant output
during kafkatee startup:
%7|1410991470.795|OFFSET|kafkatee#consumer-0| webrequest_mobile [2] using
offset file /var/cache/kafkatee/offsets/webrequest_mobile-2.offset
%7|1410991470.795|OFFSET|kafkatee#consumer-0| webrequest_mobile [2]: Read
offset 7486778007 from offset file
(/var/cache/kafkatee/offsets/webrequest_mobile-2.offset)
%7|1410991471.224|FETCH|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21:
Fetch topic webrequest_mobile [2] at offset 7486778007
%7|1410991471.429|OFFSET|kafkatee#consumer-0| webrequest_mobile [2]: offset
reset (at offset 7486778007) to -1: Fetch response: Broker: Offset out of range
%7|1410991472.537|OFFSET|kafkatee#consumer-0|
analytics1021.eqiad.wmnet:9092/21: OffsetRequest (-1) for topic
webrequest_mobile [2]
%7|1410991472.537|FETCH|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21:
Skipping topic webrequest_mobile [2] in state offset-wait
%7|1410991472.544|OFFSET|kafkatee#consumer-0|
analytics1021.eqiad.wmnet:9092/21: OffsetReply for topic webrequest_mobile [2]:
offset 3759198618: activating fetch
%7|1410991472.544|OFFSET|kafkatee#consumer-0|
analytics1021.eqiad.wmnet:9092/21: OffsetReply for topic webrequest_mobile [2]:
Success
%7|1410991473.254|FETCH|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21:
Fetch topic webrequest_mobile [2] at offset 3759199122
%7|1410991474.441|FETCH|kafkatee#consumer-0| analytics1021.eqiad.wmnet:9092/21:
Fetch topic webrequest_mobile [2] at offset 3759199454
…
Although, those are probably not interesting, as clearly the old offset is no
longer relevant. Hm…
But, even though these mobile partitions are not updating their offset files,
they are consuming and updating next_offset in kafkatee.stats.json just fine.
== Case 1b: Not so old offsets, but not consuming(?)
During runtime, kafkatee stops updating a few of the offset files. This might
not be bad in itself, except we are see a missing data proportional to the
number of offset files that are not updating. E.g. if 1 out of 12 of the
offset files is not updating, then we miss 1/12th of output data for upload.
When this happens, kafkatee.stats.json fails to report stats for the partitions
that ARE updating. As I write this email, I see that the
webrequest_upload-11.offset file has not updated in 10 minutes. If I look at
the partitions statistics for webrequest_upload, I only see information for a
single (real) partition, partition 11:
$ tail -n 1 kafkatee.stats.json | jq
'.kafka.topics.webrequest_upload.partitions'
{
"-1": {
"txbytes": 0,
"xmit_msgq_bytes": 0,
"xmit_msgq_cnt": 0,
"msgq_bytes": 0,
"msgq_cnt": 0,
"unknown": false,
"desired": false,
"leader": -1,
"partition": -1,
"fetchq_cnt": 0,
"fetch_state": "none",
"query_offset": 0,
"next_offset": 0,
"app_offset": 0,
"commited_offset": 0,
"eof_offset": 0,
"txmsgs": 0
},
"11": {
"txbytes": 0,
"xmit_msgq_bytes": 0,
"xmit_msgq_cnt": 0,
"msgq_bytes": 0,
"msgq_cnt": 0,
"unknown": true,
"desired": true,
"leader": -1,
"partition": 11,
"fetchq_cnt": 0,
"fetch_state": "active",
"query_offset": 0,
"next_offset": 31744699547,
"app_offset": 0,
"commited_offset": 31744699547,
"eof_offset": 0,
"txmsgs": 0
}
}
It isn’t consuming anything! Curious that the leader is -1, eh? What does
that mean?
= Problem 2. kafkatee.stats.json not reporting all partition stats =
Welp, I guess I just described this problem above.
$ for topic in mobile bits text upload; do
echo -n "reported $topic partitions: “
tail -n 1 kafkatee.stats.json | jq ".kafka.topics.webrequest_$topic.partitions"
| grep '"fetch_state": "active"' | wc -l
done
reported mobile partitions: 1
reported bits partitions: 12
reported text partitions: 12
reported upload partitions: 1
Both mobile and upload have a single offset file that hasn’t been updated in
many minutes (excluding those mobile files that haven’t been updated since
June). For mobile and upload, kafkatee.stats.json only reports partition
statistics for these 2 partitions. It does not report any statistics about the
partitions that are actively updating their offset files.
= Problem 3: Some partitions are not writing at all =
This has also been described, but is a really big problem! As far as we can
tell, the 2 weirdo partitions that won’t update their offset files do not
actually consume (or output?) any data at all.
Which partitions that go weirdo after a kafkatee restart are not predictable,
but we have not yet seen this happen to any webrequest_text partitions. But,
maybe that’s just a fluke.
Hm, after writing all of the above, it seems that there is some bug where, some
topics can't consume, which causes kafkatee.stats.json to get all wonky.
Phew, I’m still confused about all of these moving parts. I’ll re-read this
tomorrow and see if it makes more sense. Magnus, if you are around tomorrow,
let’s chat on IRC about this, eh?
Thanks!
_Ao
--
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l