Michael added a comment.
In T255410#6549439 <https://phabricator.wikimedia.org/T255410#6549439>, @akosiaris wrote: > In T255410#6543118 <https://phabricator.wikimedia.org/T255410#6543118>, @Michael wrote: > >> @akosiaris Thank you a lot for your detailed response. I did look into those errors a tiny bit more to properly document them as can be now seen on wikitech <https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Availability_objectives_and_accepted_operational_errors>. >> >> In the course of that I looked at the last days and noticed some discrepancies to the numbers you provided above. All the following data is for the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59. > > I think you just exposed some weird behavior/bug in prometheus's `increase()` function regarding counter resets. I 've added a panel to the graph showcasing it. If you manually substract the peaks from the valleys for the 3 distinct timeframes depicted there you get almost the same errors as logstash. It's `62-0 + 99 - 0 + 484 - 440= 170`. It's probably that last (first timewise) timeframe that throughs prometheus off. Given that per the docs [1] > > It is syntactic sugar for rate(v) multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability. > > there is probably something funny going on over the large timeframe. The rate() is also depicted in the panel and itis gradually dropping as well but it's quite higher in the first timeframe. That seems very strange. I would have expected the //error rate// to be calculated by `(number of errors / number of total requests)` for the given timeframe. How does it actually work? Something like `(number of milliseconds with error/number of total milliseconds in timeframe)`? >> I was surprised by that, but noticed that there were also a similar amount of network errors between MediaWiki and the Termbox SSR app in that timeframe: >> >> - the MediaWiki (PHP) logstash <https://logstash.wikimedia.org/goto/995becc306bb3da55de9e321631c40d0> has **104** errors of Termbox being unreachable > > That's actually from the PoV of mediawiki. If you put this logstash dashboard and the termbox one side-by-side there's considerable overlap as events are depicted in both. Oh, you are right! If I look only at the timeouts and remove the timeouts to the unused datacenter (triggered by the health checks), then they even line up almost perfectly! >> Is the understanding layed out above correct? > > I think it's wrong to sum the 2 logstash dashboards (in fact, it's just coincidence that the numbers added up to something close to 277 as that was a made up number from prometheus). They are of a different nature and thus wrong to sum as you will be double counting events. Thank you for providing this feedback! I've updated our documentation <https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Availability_objectives_and_accepted_operational_errors>. TASK DETAIL https://phabricator.wikimedia.org/T255410 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Michael Cc: toan, Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs