akosiaris added a comment.
In T255410#6550492 <https://phabricator.wikimedia.org/T255410#6550492>, @Michael wrote: > > That seems very strange. I would have expected the //error rate// to be calculated by `(number of errors / number of total requests)` for the given timeframe. How does it actually work? Something like `(number of milliseconds with error/number of total milliseconds in timeframe)`? You can say that again :-). The main formula is what you described. In prometheus terms, it's sum(increase(service_runner_request_duration_seconds_count{service="termbox", prometheus="k8s", uri="termbox", status="500"}[$__range])) / sum(increase(service_runner_request_duration_seconds_count{service="termbox", prometheus="k8s", uri="termbox", status=~"200|500"}[$__range])) and that's what the left panel in that dashboard has. The issue isn't with the division, it's rather with the increase() function (the right hand side panel is just the nominator of the above equation), so it's sum( increase( service_runner_request_duration_seconds_count{service="termbox", prometheus="k8s", uri="termbox", status="500"}[$__range] ) ) The `sum()` is to sum across all the instances of termbox in that timeframe, the `increase()` is to calculate the change in that quantity from start to end of the timeframe. Normally it works, but in this case, it has failed. My guess as to what has happened is that due to 2 deployments (you can use the main termbox dashboard to spot them) termbox pods were destroyed and new ones started. So metrics changed and the internal counter resetting detection of rate() could not function. If you target a week without deploys, you aren't gonna witness that. If you are more interested about prometheus counter, there's more info about counters and how they work in prometheus at https://www.robustperception.io/how-does-a-prometheus-counter-work It also means we 'll have to figure out how to calculate better the SLO across large timeframes. TASK DETAIL https://phabricator.wikimedia.org/T255410 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Michael, akosiaris Cc: toan, Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs