dcausse created this task. dcausse added a project: Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION For 7 hours (`2022-02-06T23:00:00` to `2022-02-07T06:20:00`) the streaming updater in `eqiad` stopped working properly preventing edits to flow to all the wdqs machines in eqiad. The lag started to rise in eqiad and caused edits to be throttled during this period: F34944091: Capture d’écran du 2022-02-07 11-40-08.png <https://phabricator.wikimedia.org/F34944091> Investigations: - the streaming updater for WCQS went down from `2022-02-06T16:32:00` to `2022-02-06T23:00:00` - the streaming updater for WDQS went down from `2022-02-06T23:00:00` to `2022-02-07T06:20:00` - the number of total task slots went down to 20 from 24 (4tasks == 1pod) between `2022-02-06T16:32:00` and `2022-02-07T06:20:00` causing resource starvation and preventing both jobs from running at the same time (`flink_jobmanager_taskSlotsTotal{kubernetes_namespace="rdf-streaming-updater"}`) - kubernetes1014 (T301099 <https://phabricator.wikimedia.org/T301099>) seemed to have showed problems during this same period (`2022-02-06T16:32:00` to `2022-02-07T06:20:00`) - the deployment used by the updater used one POD (`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a`) from kubernetes1014 - the flink session cluster was able to regain its 24 slots after after `1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` came back (at `2022-02-07T08:07:00`), then this POD disappeared again in favor of another one and the service successfully restarted. - during the whole incident k8s metrics & flink metrics seem to disagree: - flink says that it lost 4 task managers (1 POD) - k8s always reports at least 6 PODS (`count(container_memory_usage_bytes{namespace="rdf-streaming-updater", container="flink-session-cluster-main-taskmanager"})`) Questions: - why do flink and k8s metrics disagree (active PODs vs number of task manager)? - why a new POD was not created after kubernetes1014 went down (making `1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` unavailable to the deployment)? What could we have done better: - we could have route wdqs traffic to codfw during the outage and avoid throttling edits TASK DETAIL https://phabricator.wikimedia.org/T301147 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: Aklapper, dcausse, MPhamWMF, CBogen, Namenlos314, Gq86, Lucas_Werkmeister_WMDE, EBjune, merbst, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org