[Wikidata-bugs] [Maniphest] T263764: Termbox service: unusual errors that could be from envoy

2020-10-07 Thread akosiaris
akosiaris edited projects, added serviceops-radar; removed serviceops.
akosiaris added a comment.


  Envoy is being documented at 
https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF. It is being used by 
termbox to talk to mediawiki (it's a component of a service mesh). The idea is 
to have low cost persistent TLS connections, with retries and telemetry. More 
more insights aside from the doc link above the following grafana dashboard is 
useful 
https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=now-7d&to=now
  
  It is expected and absolutely normal that occasionally connections will be 
terminated and reestablished by envoy as the network is not infallible. Some 
will be "masked" by envoy's retry logic, at the cost of extra latency of course.
  
  Using the dashboard above can help tracking down some of the errors. Logs 
from envoy for termbox are also in logstash, just remove the severity filter 
and they 'll appear.
  
  Parsing them can be done using 
https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage
  
  A couple of notes though.
  
  - Those log entries aren't parsed into a json object unfortunately
  - envoy uses HTTP2 terminology for some stuff internally, even if HTTP1.1 is 
used. E.g. you will see `%REQ(:AUTHORITY)%`. That is the authority HTTP2 header 
(https://tools.ietf.org/html/rfc7540#section-8.1.2). That's equivalent to the 
Host HTTP/1.1 header
  - The response flags are usually telling. e.g. `UF: Upstream connection 
failure in addition to 503 response code.` or `URX: The request was rejected 
because the upstream retry limit (HTTP) or maximum connect attempts (TCP) was 
reached.` and so on
  
  Hopefully the above helps shed a bit of light.
  
  Finally as far as the `Should we be taking any action about these?`, question 
goes, my answer would be to use the service's SLO as a guide. As pointed out in 
T255410 , it doesn't seem worthy to 
investigate those more right now.

TASK DETAIL
  https://phabricator.wikimedia.org/T263764

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: akosiaris
Cc: toan, Michael, JMeybohm, Tarrow, akosiaris, Aklapper, wkandek, Akuckartz, 
darthmon_wmde, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Lydia_Pintscher, Mbch331, Dzahn
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T263764: Termbox service: unusual errors that could be from envoy

2020-09-24 Thread Tarrow
Tarrow removed a subscriber: serviceops.
Tarrow added a project: serviceops.

TASK DETAIL
  https://phabricator.wikimedia.org/T263764

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Tarrow
Cc: JMeybohm, Tarrow, akosiaris, Aklapper, wkandek, Akuckartz, darthmon_wmde, 
Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331, Dzahn, 
#serviceops
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T263764: Termbox service: unusual errors that could be from envoy

2020-09-24 Thread Tarrow
Tarrow created this task.
Tarrow added projects: Wikidata-Termbox, Wikidata.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  We're seeing errors coming from our Termbox Service that look like this 
.
 We’re trying to make sure that we have some understanding of the different 
types of timeout so that we can minimize them where possible. We're also seeing 
this 

 this flurry of related errors that we think is coming from some envoy thing.
  
  We didn't find a corresponding error from an App server that we were 
expecting to see (e.g. an error from Special:EntityData). Thus, we think our 
connection is perhaps having problems with envoy.
  
  So we wonder if those errors might actually come from the TLS envoy service 
that we don't really understand. How would we track down where those errors are 
coming from? Should we be taking any action about these?
  
  Maybe @akosiaris or JMeybohm might be able to give us a hint? We guess it 
might be related to T254581 

TASK DETAIL
  https://phabricator.wikimedia.org/T263764

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Tarrow
Cc: Tarrow, akosiaris, Aklapper, Akuckartz, darthmon_wmde, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T263764: Termbox service: unusual errors that could be from envoy

2020-09-24 Thread Tarrow
Tarrow added subscribers: JMeybohm, serviceops.
Tarrow updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T263764

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Tarrow
Cc: #serviceops, JMeybohm, Tarrow, akosiaris, Aklapper, Akuckartz, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs