akosiaris added a project: serviceops-radar. akosiaris added a comment.
Sorry for not answering earlier. In T255410#6494077 <https://phabricator.wikimedia.org/T255410#6494077>, @Pablo-WMDE wrote: > > I unfortunately don't know how to do this for single documents. The links show the warning for me as well but reproduce fine. Ok, then that should suffice. Thanks for confirming that. I can confirm there is minor increase in errors during the timeperiod of that document. That being said, I 've been looking a bit more into this. Couple of notes: - Grafana for the last 7 days (https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1601337600000&to=1602028799000) reports 517 500s. - Logstash (https://logstash.wikimedia.org/goto/2d69a0f714d7fb66c1959da2f0e8b69a) says 533 errors. Note also that logstash includes eqiad, which isn't pooled in those 7d but is still receiving health check requests and has to reach over to codfw with added latencies. So, I am going to assume they are roughly equal, cause that 17 entries discrepancy doesn't matter for the rest of this comment - We are starting to work on some preliminary/draft SLOs for mediawiki. There is some work to be done on getting numbers, but when we come up with those, it would be prudent to adjust the SLO of termbox to those as termbox is dependent on mediawiki and for it to provide more strict SLOs than mediawiki doesn't make sense. So, we got an error rate of 0.01889% (or 0.0001889) with the SLO of the service being 0.1% (or 0.001) per T212189#5007579 <https://phabricator.wikimedia.org/T212189#5007579>. The flip side of that is an availability of 99.98111% which is something to be rather proud of (see https://en.wikipedia.org/wiki/High_availability#Percentage_calculation). If we increase the timespan a bit to 30d (https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1599350400000&to=1602028799000) we get 0.07301%, 7 times higher, but still below the SLO. Note that if we bump this before August 26th, the picture changes heavily, e.g. https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1597449600000&to=1602028799000. However, as https://grafana.wikimedia.org/d/wJRbI7FGk/termbox?viewPanel=15&orgId=1&from=1598227200000&to=1598659199000 shows, a deployment(https://sal.toolforge.org/production?p=0&q=&d=2020-08-26 points out e03ee593f57adc7556f7a4 <https://phabricator.wikimedia.org/rDEPLOYCHARTSe03ee593f57adc7556f7a4af063caabea33c395c> - enabling the service proxy in fact) fixed that already, so corrective action has been taken since this task was created. Let's stick to the 7d timespan for now. Notes again: - Total of 533 errors in logstash - 273 are `timeout of 3000ms exceeded` - tracked in T255450 <https://phabricator.wikimedia.org/T255450>. This seems to me the most interesting to visit. - 170 are `Request failed with status code 500` - All of those are constrained in the timespan 2020-09-30T20:28:57 to 2020-09-30T22:01:05 and it's mediawiki that is returning those errors. https://logstash.wikimedia.org/goto/02a4bbcab3b7864b4b9a91fd7a26fb4a. - 77 are `Request failed with status code 503`. Those are from the sidecar envoy instance that termbox uses to connect to mediawiki. The reasons for adopting envoy are explained in https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF (note that the same component also offers TLS termination so that termbox doesn't need to know or care for our internal TLS configuration). I guess this is also tracked in T263764 <https://phabricator.wikimedia.org/T263764>, so I 'll add a bit more information into that. - The rest 13 events don't seem worthy of looking into more. Overall, I am inclined to say that while the SLO isn't being violated over the course of the quarter, this should be a low priority. In T255410#6494416 <https://phabricator.wikimedia.org/T255410#6494416>, @toan wrote: > @akosiaris I did some tinkering in the kibana ui and came up with this (hopefully) shareable link <https://logstash.wikimedia.org/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-12h,mode:quick,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:gerrit1001),query:(match:(host:(query:gerrit1001,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:grafana1002),query:(match:(host:(query:grafana1002,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:gerrit2001),query:(match:(host:(query:gerrit2001,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:meta.stream,negate:!t,type:phrase,value:w3c.reportingapi.network_error),query:(match:(meta.stream:(query:w3c.reportingapi.network_error,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:phab1001),query:(match:(host:(query:phab1001,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:cp3054.esams.wmnet),query:(match:(host:(query:cp3054.esams.wmnet,type:phrase))))),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:termbox)),sort:!('@timestamp',desc))> > > The 503's seem to occur seemingly at random but in this interesting bursts pattern. Thanks, I am having some problem drilling down into the events using that link as it has a variety of events (e.g. all errors from the termbox service, multiple different errors from the wikibase mw extension). I 'd like to believe that the termbox dashboard I linked to above is a bit more helpful. TASK DETAIL https://phabricator.wikimedia.org/T255410 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: toan, akosiaris Cc: Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs