akosiaris added a project: serviceops-radar.
akosiaris added a comment.

  Sorry for not answering earlier.
  
  In T255410#6494077 <https://phabricator.wikimedia.org/T255410#6494077>, 
@Pablo-WMDE wrote:
  
  > 
  
  
  
  > I unfortunately don't know how to do this for single documents. The links 
show the warning for me as well but reproduce fine.
  
  Ok, then that should suffice. Thanks for confirming that. I can confirm there 
is minor increase in errors during the timeperiod of that document.
  
  That being said, I 've been looking a bit more into this. Couple of notes:
  
  - Grafana for the last 7 days 
(https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1601337600000&to=1602028799000)
 reports 517 500s.
  - Logstash 
(https://logstash.wikimedia.org/goto/2d69a0f714d7fb66c1959da2f0e8b69a) says 533 
errors. Note also that logstash includes eqiad, which isn't pooled in those 7d  
but is still receiving health check requests and has to reach over to codfw 
with added latencies. So, I am going to assume they are roughly equal, cause 
that 17 entries discrepancy doesn't matter for the rest of this comment
  - We are starting to work on some preliminary/draft SLOs for mediawiki. There 
is some work to be done on getting numbers, but when we come up with those, it 
would be prudent to adjust the SLO of termbox to those as termbox is dependent 
on mediawiki and for it to provide more strict SLOs than mediawiki doesn't make 
sense.
  
  So, we got an error rate of 0.01889% (or 0.0001889) with the SLO of the 
service being 0.1% (or 0.001) per T212189#5007579 
<https://phabricator.wikimedia.org/T212189#5007579>. The flip side of that is 
an availability of 99.98111% which is something to be rather proud of (see 
https://en.wikipedia.org/wiki/High_availability#Percentage_calculation).
  
  If we increase the timespan a bit to 30d 
(https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1599350400000&to=1602028799000)
 we get 0.07301%, 7 times higher, but still below the SLO.
  
  Note that if we bump this before August 26th, the picture changes heavily, 
e.g. 
https://grafana.wikimedia.org/d/JcMStTFGz/termbox-slo-panel?orgId=1&from=1597449600000&to=1602028799000.
 However, as 
https://grafana.wikimedia.org/d/wJRbI7FGk/termbox?viewPanel=15&orgId=1&from=1598227200000&to=1598659199000
 shows, a deployment(https://sal.toolforge.org/production?p=0&q=&d=2020-08-26 
points out e03ee593f57adc7556f7a4 
<https://phabricator.wikimedia.org/rDEPLOYCHARTSe03ee593f57adc7556f7a4af063caabea33c395c>
 - enabling the service proxy in fact) fixed that already, so corrective action 
has been taken since this task was created.
  
  Let's stick to the 7d timespan for now. Notes again:
  
  - Total of 533 errors in logstash
  - 273 are `timeout of 3000ms exceeded` - tracked in T255450 
<https://phabricator.wikimedia.org/T255450>. This seems to me the most 
interesting to visit.
  - 170 are `Request failed with status code 500` - All of those are 
constrained in the timespan 2020-09-30T20:28:57 to 2020-09-30T22:01:05 and it's 
mediawiki that is returning those errors. 
https://logstash.wikimedia.org/goto/02a4bbcab3b7864b4b9a91fd7a26fb4a.
  - 77 are `Request failed with status code 503`. Those are from the sidecar 
envoy instance that termbox uses to connect to mediawiki. The reasons for 
adopting envoy are explained in 
https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF (note that the same 
component also offers TLS termination so that termbox doesn't need to know or 
care for our internal TLS configuration). I guess this is also tracked in 
T263764 <https://phabricator.wikimedia.org/T263764>, so I 'll add a bit more 
information into that.
  - The rest 13 events don't seem worthy of looking into more.
  
  Overall, I am inclined to say that while the SLO isn't being violated over 
the course of the quarter, this should be a low priority.
  
  In T255410#6494416 <https://phabricator.wikimedia.org/T255410#6494416>, @toan 
wrote:
  
  > @akosiaris I did some tinkering in the kibana ui and came up with this 
(hopefully)  shareable link 
<https://logstash.wikimedia.org/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-12h,mode:quick,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:gerrit1001),query:(match:(host:(query:gerrit1001,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:grafana1002),query:(match:(host:(query:grafana1002,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:gerrit2001),query:(match:(host:(query:gerrit2001,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:meta.stream,negate:!t,type:phrase,value:w3c.reportingapi.network_error),query:(match:(meta.stream:(query:w3c.reportingapi.network_error,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:phab1001),query:(match:(host:(query:phab1001,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:host,negate:!t,type:phrase,value:cp3054.esams.wmnet),query:(match:(host:(query:cp3054.esams.wmnet,type:phrase))))),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:termbox)),sort:!('@timestamp',desc))>
  >
  > The 503's seem to occur seemingly at random but in this interesting bursts 
pattern.
  
  Thanks, I am having some problem drilling down into the events using that 
link as it has a variety of events (e.g. all errors from the termbox service, 
multiple different errors from the wikibase mw extension). I 'd like to believe 
that the termbox dashboard I linked to above is a bit more helpful.

TASK DETAIL
  https://phabricator.wikimedia.org/T255410

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: toan, akosiaris
Cc: Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, 
Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, 
Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to