toan added a comment.
First off, thank you @akosiaris for your time investigating this. At the start of this task we were presented with the title "Termbox SSR connection terminated very often" and the logstash screenshot of the events that took place starting at 2020-06-10. Since then the patch https://gerrit.wikimedia.org/r/c/605554 has been applied and we are now seeing a much lower error rate. No other fix that we are aware of could be attributed to be mitigating these errors. https://phabricator.wikimedia.org/T255450 "timeout of 3000ms exceeded" seems to happen in burst patterns where these rarely occur alone but rather spread out over multiple kubernetes containers for a couple seconds or less. For the last months going back to 2020-07-12 these errors were much more frequent where the error rates averaged around 5000 every week on eqiad until 2020-08-31 (https://logstash-next.wikimedia.org/goto/4d74e7ec568b38ecebfe01328a2bad2d). After this period up until the time of writing this the error rates are again much lower ranging from 100 - 300 every week for both eqiad and codfw (https://logstash-next.wikimedia.org/goto/86501120a879772a1db250abb1c250dc) https://phabricator.wikimedia.org/T263764 The second re-occuring error is the "Request failed with status code 503" ("upstream connect error or disconnect/reset before headers. reset reason: local reset") which does not follow the same pattern as the first timeout. They happen seemingly at random and most of the time there is only one error reported not a series. For the last month these seem to be occurring around ~100 or less times every week with the exception of the period 2020-09-10 16:00 - 21:00 (https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=1599753677237&to=1599770627856) As mentioned here https://phabricator.wikimedia.org/T263764#6524589 errors like these are expected to occur and as we are still below the SLO we will not investigate these further for the time being. After discussing this with @Michael this evening we've decided these two most prominent errors will not be investigated further now using the the SLO as the guideline for that decision. Instead they will be documented on https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service with a section describing these errors, their expected frequency and where in the stack they occur. TASK DETAIL https://phabricator.wikimedia.org/T255410 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Michael, toan Cc: toan, Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
