[Bug 67817] New: Monitor for anomalies/spikes in read failures of memcached

bugzilla-daemon Thu, 10 Jul 2014 11:29:11 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=67817


            Bug ID: 67817
           Summary: Monitor for anomalies/spikes in read failures of
                    memcached
           Product: Wikimedia
           Version: wmf-deployment
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: General/Unknown
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected], [email protected]
       Web browser: ---
   Mobile Platform: ---

(Came out of
https://wikitech.wikimedia.org/wiki/Incident_documentation/20140517-bits )

Discussion:

Timo: In retrospect we saw that we had data in logstash clearly indicating a
massive increase in read failures from this memcached instance (basicaly from <
1% to nearly 100%). This could and should be monitored by icinga and reported
to ops automatically. This would've helped us catch it much earlier.

Chat from irc on 2014-07-09:
[10:16]  <Krinkle>  For something that is logged in logstash (e.g. memcached
errors). What is the strategy you'd typically take to monitor it icinga? Is
there a step in between or would you actually have icinga use logstash?
[10:16]  <Krinkle>  I think the latter should be possible for more complex
queries or aggregated data. Though I reckon in case of memcached there's
probably a more direct approach possible.
[10:18]  <bd808>  Good question. Logstash by itself can do point-in-time
monitoring, but it really has no useful way to alert on trends itself. 
[10:19]  <Krinkle>  I think most critical thigns should probably be polled by
icinga directly. But more on multiple ocasions have I used logstash to quite
easily pinpoint where an error came from. And it'd be useful to have those
trends also result in pings to ops (perhaps not as critical via text but at
least an irc ping would be useful).
[10:20]  <bd808>  One way™ to do it would be to graph trends in graphite driven
by counts made by logstash and alert with icinga when the trend does something.
[10:20]  <Krinkle>  Right now logstash is mostly polling and digging manually,
after the fact. That's immensly useful and it's good at that. But I think it
has more potential.
[10:20]  <Krinkle>  Ah, I see. So it'd go to graphite after logstash.
Interesting.
[10:20]  <bd808>   We aren't doing it now, but logstash can feed graphite in a
statsd fashion
[10:20]  <Krinkle>  Right. 
[10:21]  <Krinkle>  For some reason I thought they might also be able to feed
graphite from the source that feeds logstash.
[10:21]  <Krinkle>   guess that's still possible, unless the source is
distributed  (or if the query is more advanced). In which case using logstash
in between makes sense
[10:21] <  Krinkle> (or if the query is more advanced)
[10:22] bd808  nods
[10:22]  <Krinkle>  cool

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 67817] New: Monitor for anomalies/spikes in read failures of memcached

Reply via email to