https://bugzilla.wikimedia.org/show_bug.cgi?id=69244
Bug ID: 69244
Summary: Kafka broker analytics1021 having issues on 2014-08-06
~1:44
Product: Analytics
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: Unprioritized
Component: General/Unknown
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected],
[email protected], [email protected],
[email protected], [email protected]
Web browser: ---
Mobile Platform: ---
Created attachment 16152
--> https://bugzilla.wikimedia.org/attachment.cgi?id=16152&action=edit
analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate
From http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140807.txt
[13:45:20] <mutante> analytics1021:
[13:45:22] <mutante> 3/3
kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate
CRITICAL: 7.42708492353e-59
[13:54:36] <tnegrin> gage?
[14:01:03] <tnegrin> mutante: andrew is out today -- is that alert
repeating?
[14:01:50] <mutante> tnegrin: yes, it started a little over 1 day ago
[14:02:05] <tnegrin> hmm -- the graphs I look at all look normal
[14:02:06] <mutante> at wikimania but not sure how criticial it is
[14:03:06] <tnegrin> SF comes online in a few hours -- can you sleep it for
2 hours?
[14:03:12] <tnegrin> I will have gage look at it
[14:03:24] <tnegrin> (I don't think it's critical)
[14:04:14] <mutante> yes, i can
[14:04:18] <mutante> ok, thanks
[14:04:35] <tnegrin> thank
[14:04:37] <tnegrin> thanks
Ganglia shows analytics1021 Messages going down, and other brokers
taking over.
(See attachments
analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate.png
Cluster-MessagesInPerSec-OneMinuteRate.png
Cluster-RequestsPerSec-OneMinuteRate.png
)
It seems to have happened around 2014-08-07 01:44
There, according to /var/log/kafka/kafka.log on analytics1021, the
zookeeper connection expired [1]:
[...]
[2014-08-06 01:44:36,974] 101327050 [main-EventThread] INFO
org.I0Itec.zkclient.ZkClient - zookeeper state changed (Expired)
[...]
and could not connect to the ZooKeeper again
[...]
[2014-08-06 01:44:37,061] 101327137
[main-SendThread(analytics1024.eqiad.wmnet:2181)] INFO
org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service,
session 0x146fd72a83d0dbe has expired, closing socket connection
[...]
Then after re-connection, re-election took part:
[2014-08-06 01:44:37,215] 101327291
[ZkClient-EventThread-14-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
INFO kafka.controller.KafkaController$SessionExpirationListener -
[SessionExpirationListener on 21], ZK expired; shut down all controller
components and try to re-elect
[2014-08-06 01:44:37,272] 101327348
[ZkClient-EventThread-14-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
INFO kafka.utils.ZkUtils$ - conflict in /controller data:
{"version":1,"brokerid":21,"timestamp":"1407289477248"} stored data:
{"version":1,"brokerid":22,"timestamp":"1407187809296"}
[1] Typically changes between Disconnected and SyncConected, with only a few
hundret ms in Disconnected state
--
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l