EBernhardson added a comment.

  Usually the first stop for this kind of error would be reviewing the `ATS 
Backends <-> Origin Servers Overview` which suggest a low rate of 5xxs, 
typically 1-5% of requests fail. In a quick review of the last few 500 requests 
on one of the servers they were all malformed queries. We may need to look into 
more specific timespans rather than the generic 500 errors. Modifying one of 
the dashboard queries[1] to return success rate per 15 minutes and running it 
against thanos to get all DC's, looking for  time periods of low success, the 
following time periods should be reviewed:
  
  2022-04-16T17:30-18:10
  2022-04-17T08:30-10:00
  2022-04-22T16:26-17:12
  2022-04-22T19:09-19:36
  2022-04-26T16:20-17:37
  2022-05-04T19:50-21:42
  
  If this turns up the problem we could consider how it could be turned into an 
alert.
  
  [1]
  
    
sum(increase(trafficserver_backend_requests_seconds_count{status=~"2[0-9][0-9]",
 cluster=~"cache_text", backend=~"wcqs\\.discovery\\.wmnet"}[15m])) by (backend)
    /
    
sum(increase(trafficserver_backend_requests_seconds_count{status=~"[25][0-9][0-9]",
 cluster=~"cache_text", backend=~"wcqs\\.discovery\\.wmnet"}[15m])) by (backend)

TASK DETAIL
  https://phabricator.wikimedia.org/T306899

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: EBernhardson
Cc: EBernhardson, FRomeo_WMF, GFontenelle_WMF, Gehel, Fuzheado, Aklapper, 
Dominicbm, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, 
CBogen, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to