https://bugzilla.wikimedia.org/show_bug.cgi?id=71056

--- Comment #1 from Andrew Otto <[email protected]> ---
Magnus and I worked to try to figure out what was going on.  We have upgraded
librdkafka to 0.8.4 on analytics1003 (and also attempted to use broker offset
storage).

By only including the webrequest_upload as input to kafkatee, I was able to
reproduce this problemĀ in a simplified setting.  Evidence points to
analytics1021 as being the cause of this problem (again).

Likely related, is this bug:
  https://github.com/edenhill/librdkafka/issues/147
  https://issues.apache.org/jira/browse/KAFKA-1367

I restarted analytics1021 and issue a preferred-replica-election.  This did not
solve the broker/zookeeper metadata mismatch (described in those bugs), but it
did solve the problem of kafkatee not consuming from all partitions.

I'm not entirely sure how to move on from here.  I'm going to re-add kafkatee
consuming from all topics, and see how things go over the weekend.  I wonder if
this has something to do with a metadata refresh bug in librdkafka/kafka + the
weird analytics1021 kafka<->zookeeper timeout bug[1] that we have been
struggling with.

If these issues persist, I think we should consider dropping analytics1021 from
our Kafka cluster.  Its hard to say if we have problems because of this
machine, or because of the Rack/network it is in, or because of a fluke.


[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=69667

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to