Michael added a comment.

  In T255410#6549439 <https://phabricator.wikimedia.org/T255410#6549439>, 
@akosiaris wrote:
  
  > In T255410#6543118 <https://phabricator.wikimedia.org/T255410#6543118>, 
@Michael wrote:
  >
  >> @akosiaris Thank you a lot for your detailed response. I did look into 
those errors a tiny bit more to properly document them as can be now seen on 
wikitech 
<https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Availability_objectives_and_accepted_operational_errors>.
  >>
  >> In the course of that I looked at the last days and noticed some 
discrepancies to the numbers you provided above. All the following data is for 
the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59.
  >
  > I think you just exposed some weird behavior/bug in prometheus's 
`increase()` function regarding counter resets. I 've added a panel to the 
graph showcasing it. If you manually substract the peaks from the valleys for 
the 3 distinct timeframes depicted there you get almost the same errors  as 
logstash. It's `62-0 + 99 - 0 + 484 - 440= 170`. It's probably that last (first 
timewise) timeframe that throughs prometheus off. Given that per the docs [1]
  >
  >   It is syntactic sugar for rate(v) multiplied by the number of seconds 
under the specified time range window, and should be used primarily for human 
readability.
  >
  > there is probably something funny going on over the large timeframe. The 
rate() is also depicted in the panel and itis gradually dropping as well but 
it's quite higher in the first timeframe.
  
  That seems very strange. I would have expected the //error rate// to be 
calculated by `(number of errors / number of total requests)` for the given 
timeframe. How does it actually work? Something like `(number of milliseconds 
with error/number of total milliseconds in timeframe)`?
  
  >> I was surprised by that, but noticed that there were also a similar amount 
of network errors between MediaWiki and the Termbox SSR app in that timeframe:
  >>
  >> - the MediaWiki (PHP) logstash 
<https://logstash.wikimedia.org/goto/995becc306bb3da55de9e321631c40d0> has 
**104** errors of Termbox being unreachable
  >
  > That's actually from the PoV of mediawiki. If you put this logstash 
dashboard and the termbox one side-by-side there's considerable overlap as 
events are depicted in both.
  
  Oh, you are right! If I look only at the timeouts and remove the timeouts to 
the unused datacenter (triggered by the health checks), then they even line up 
almost perfectly!
  
  >> Is the understanding layed out above correct?
  >
  > I think it's wrong to sum the 2 logstash dashboards (in fact, it's just 
coincidence that the numbers added up to something close to 277 as that was a 
made up number from prometheus). They are of a different nature and thus wrong 
to sum as you will be double counting events.
  
  Thank you for providing this feedback! I've updated our documentation 
<https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Availability_objectives_and_accepted_operational_errors>.

TASK DETAIL
  https://phabricator.wikimedia.org/T255410

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Michael
Cc: toan, Lucas_Werkmeister_WMDE, Sakretsu, akosiaris, JMeybohm, WMDE-leszek, 
Pablo-WMDE, Tarrow, Jakob_WMDE, Addshore, Aklapper, Michael, wkandek, 
Akuckartz, Iflorez, darthmon_wmde, alaa_wmde, Nandana, jijiki, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to