https://bugzilla.wikimedia.org/show_bug.cgi?id=69244

            Bug ID: 69244
           Summary: Kafka broker analytics1021 having issues on 2014-08-06
                    ~1:44
           Product: Analytics
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: General/Unknown
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected],
                    [email protected], [email protected],
                    [email protected], [email protected]
       Web browser: ---
   Mobile Platform: ---

Created attachment 16152
  --> https://bugzilla.wikimedia.org/attachment.cgi?id=16152&action=edit
analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate

From http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140807.txt

[13:45:20] <mutante>     analytics1021:
[13:45:22] <mutante>     3/3
kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate
CRITICAL: 7.42708492353e-59
[13:54:36] <tnegrin>     gage?
[14:01:03] <tnegrin>     mutante: andrew is out today -- is that alert
repeating?
[14:01:50] <mutante>     tnegrin: yes, it started a little over 1 day ago
[14:02:05] <tnegrin>     hmm -- the graphs I look at all look normal
[14:02:06] <mutante>     at wikimania but not sure how criticial it is
[14:03:06] <tnegrin>     SF comes online in a few hours -- can you sleep it for
2 hours?
[14:03:12] <tnegrin>     I will have gage look at it
[14:03:24] <tnegrin>     (I don't think it's critical)
[14:04:14] <mutante>     yes, i can
[14:04:18] <mutante>     ok, thanks
[14:04:35] <tnegrin>     thank
[14:04:37] <tnegrin>     thanks

Ganglia shows analytics1021 Messages going down, and other brokers
taking over.

(See attachments
  analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate.png
  Cluster-MessagesInPerSec-OneMinuteRate.png
  Cluster-RequestsPerSec-OneMinuteRate.png
)

It seems to have happened around 2014-08-07 01:44

There, according to /var/log/kafka/kafka.log on analytics1021, the
zookeeper connection expired [1]:

  [...]
  [2014-08-06 01:44:36,974] 101327050 [main-EventThread] INFO 
org.I0Itec.zkclient.ZkClient  - zookeeper state changed (Expired)
  [...]

and could not connect to the ZooKeeper again

  [...]
  [2014-08-06 01:44:37,061] 101327137
[main-SendThread(analytics1024.eqiad.wmnet:2181)] INFO 
org.apache.zookeeper.ClientCnxn  - Unable to reconnect to ZooKeeper service,
session 0x146fd72a83d0dbe has expired, closing socket connection
  [...]

Then after re-connection, re-election took part:

[2014-08-06 01:44:37,215] 101327291
[ZkClient-EventThread-14-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
INFO  kafka.controller.KafkaController$SessionExpirationListener  -
[SessionExpirationListener on 21], ZK expired; shut down all controller
components and try to re-elect
[2014-08-06 01:44:37,272] 101327348
[ZkClient-EventThread-14-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
INFO  kafka.utils.ZkUtils$  - conflict in /controller data:
{"version":1,"brokerid":21,"timestamp":"1407289477248"} stored data:
{"version":1,"brokerid":22,"timestamp":"1407187809296"}


[1] Typically changes between Disconnected and SyncConected, with only a few
hundret ms in Disconnected state

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to