[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

dcausse Mon, 07 Feb 2022 08:31:16 -0800

dcausse created this task.
dcausse added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.


TASK DESCRIPTION
  For 7 hours (`2022-02-06T23:00:00` to `2022-02-07T06:20:00`) the streaming 
updater in `eqiad` stopped working properly preventing edits to flow to all the 
wdqs machines in eqiad.
  The lag started to rise in eqiad and caused edits to be throttled during this 
period:
  
  F34944091: Capture d’écran du 2022-02-07 11-40-08.png 
<https://phabricator.wikimedia.org/F34944091>
  
  Investigations:
  
  - the streaming updater for WCQS went down from `2022-02-06T16:32:00` to 
`2022-02-06T23:00:00`
  - the streaming updater for WDQS went down from `2022-02-06T23:00:00` to 
`2022-02-07T06:20:00`
  - the number of total task slots went down to 20 from 24 (4tasks == 1pod) 
between `2022-02-06T16:32:00` and `2022-02-07T06:20:00` causing resource 
starvation and preventing both jobs from running at the same time 
(`flink_jobmanager_taskSlotsTotal{kubernetes_namespace="rdf-streaming-updater"}`)
  - kubernetes1014 (T301099 <https://phabricator.wikimedia.org/T301099>) seemed 
to have showed problems during this same period (`2022-02-06T16:32:00` to 
`2022-02-07T06:20:00`)
  - the deployment used by the updater used one POD 
(`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a`) from kubernetes1014
  - the flink session cluster was able to regain its 24 slots after after 
`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` came back (at `2022-02-07T08:07:00`), 
then this POD disappeared again in favor of another one and the service 
successfully restarted.
  - during the whole incident k8s metrics & flink metrics seem to disagree:
    - flink says that it lost 4 task managers (1 POD)
    - k8s always reports at least 6 PODS 
(`count(container_memory_usage_bytes{namespace="rdf-streaming-updater", 
container="flink-session-cluster-main-taskmanager"})`)
  
  Questions:
  
  - why do flink and k8s metrics disagree (active PODs vs number of task 
manager)?
  - why a new POD was not created after kubernetes1014 went down (making 
`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` unavailable to the deployment)?
  
  What could we have done better:
  
  - we could have route wdqs traffic to codfw during the outage and avoid 
throttling edits

TASK DETAIL
  https://phabricator.wikimedia.org/T301147

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Aklapper, dcausse, MPhamWMF, CBogen, Namenlos314, Gq86, 
Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

Reply via email to