toan added a comment.

  First off, thank you @akosiaris for your time investigating this.
  
  At the start of this task we were presented with the title "Termbox SSR 
connection terminated very often" and the logstash screenshot of the events 
that took place starting at 2020-06-10. Since then the patch 
https://gerrit.wikimedia.org/r/c/605554 has been applied and we are now seeing 
a much lower error rate. No other fix that we are aware of could be attributed 
to be mitigating these errors.
  
  https://phabricator.wikimedia.org/T255450 "timeout of 3000ms exceeded" seems 
to happen in burst patterns where these rarely occur alone but rather spread 
out over multiple kubernetes containers for a couple seconds or less.
  
  For the last months going back to 2020-07-12 these errors were much more 
frequent where the error rates averaged around 5000 every week on eqiad until 
2020-08-31 
(https://logstash-next.wikimedia.org/goto/4d74e7ec568b38ecebfe01328a2bad2d). 
After this period up until the time of writing this the error rates are again 
much lower ranging from 100 - 300 every week for both eqiad and codfw 
(https://logstash-next.wikimedia.org/goto/86501120a879772a1db250abb1c250dc)
  
  https://phabricator.wikimedia.org/T263764 The second re-occuring error is the 
"Request failed with status code 503" ("upstream connect error or 
disconnect/reset before headers. reset reason: local reset") which does not 
follow the same pattern as the first timeout. They happen seemingly at random 
and most of the time there is only one error reported not a series.
  
  For the last month these seem to be occurring around ~100 or less times every 
week with the exception of the period 2020-09-10 16:00 - 21:00 
(https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=1599753677237&to=1599770627856)
  
  As mentioned here https://phabricator.wikimedia.org/T263764#6524589 errors 
like these are expected to occur and as we are still below the SLO we will not 
investigate these further for the time being.
  
  After discussing this with @Michael this evening we've decided these two most 
prominent errors will not be investigated further now using the the SLO as the 
guideline for that decision. Instead they will be documented on 
https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service with a section 
describing these errors, their expected frequency and where in the stack they 
occur.

TASK DETAIL
  https://phabricator.wikimedia.org/T255410

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Michael, toan
Cc: toan, Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, 
Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, 
Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to