[Wikidata-bugs] [Maniphest] T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator

2024-04-09 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T350784 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking, JMeybohm Cc: dcausse, JMeybohm, Aklapper, bking, Danny_Benjafield_WMDE, Isabelladantes1983

[Wikidata-bugs] [Maniphest] T362084: shellbox-constraints returning 500 on preg_match error

2024-04-09 Thread JMeybohm
JMeybohm added a comment. In T362084#9700057 <https://phabricator.wikimedia.org/T362084#9700057>, @Lucas_Werkmeister_WMDE wrote: > Can someone clarify what the problem here is? From WBQC’s perspective, it’s totally expected that some of these regex checks will fail (though the

[Wikidata-bugs] [Maniphest] T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator

2023-11-30 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T350784 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: dcausse, JMeybohm, Aklapper, bking, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, BTullis

[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2023-02-02 Thread JMeybohm
JMeybohm added a comment. In T293063#8582600 <https://phabricator.wikimedia.org/T293063#8582600>, @dcausse wrote: > Hey, clarified this a bit, renamed it to "Hard depool/re-pool", yes in this method the jobs should start right after the helm deploy, the jar is

[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2023-02-02 Thread JMeybohm
JMeybohm added a comment. Hey @dcausse, I'm reading this again because of the upcoming k8s 1.23 upgrade and was wondering: In "To restore:" section of "Alternate actions (not fully untested):" - do we need to start the job somehow as well, specifying w

[Wikidata-bugs] [Maniphest] T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model

2023-02-02 Thread JMeybohm
JMeybohm added a project: serviceops-radar. Restricted Application added a project: wdwb-tech. TASK DETAIL https://phabricator.wikimedia.org/T326409 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: BTullis, JMeybohm, gmodena, Ottomata

[Wikidata-bugs] [Maniphest] T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model

2023-02-02 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T326409 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: BTullis, JMeybohm, gmodena, Ottomata, bking, Aklapper, dcausse, Themindcoder, Adamm71, Jersione

[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2022-08-30 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T293063 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, dcausse, Astuthiodit_1, AWesterinen

[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2022-08-30 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T293063 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: RKemper, Gehel, bking, JMeybohm, Jelto, Aklapper, jijiki, dcausse, Astuthiodit_1, AWesterinen

[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread JMeybohm
JMeybohm added a comment. In T301147#7821813 <https://phabricator.wikimedia.org/T301147#7821813>, @dcausse wrote: > The additional PODs won't be used as a flink job does not automatically scale so it would be a pure waste of resources (2.5G of reserved mem per additional POD).

[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-03-31 Thread JMeybohm
JMeybohm added a comment. > To be discussed with service ops: > > - Investigate and address the reasons why after a node failure k8s did not fulfill its promise of making sure that the rdf-streaming-updater deployment have 6 working replicas The problem was more that the

[Wikidata-bugs] [Maniphest] T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)

2022-02-08 Thread JMeybohm
JMeybohm added a comment. In T301147#7689837 <https://phabricator.wikimedia.org/T301147#7689837>, @dcausse wrote: > @JMeybohm we're still investigating why the application did not properly recover while kubernetes1014 went down but if you have ideas on the two questions in t

[Wikidata-bugs] [Maniphest] T280485: Additional capacity on the k8s Flink cluster for WCQS updater

2021-11-17 Thread JMeybohm
JMeybohm added a comment. I'd opt for "reuse the same [flink] cluster" from the perspective that we treat this snowflaky-ish in the k8s clusters. So less flink-clusters means less snowflakes (at some point it does become a snowball, right?  ). TASK DETA

[Wikidata-bugs] [Maniphest] T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes

2021-11-09 Thread JMeybohm
JMeybohm added subscribers: Jelto, JMeybohm. JMeybohm added a comment. @dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right? As pa

[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server

2021-07-28 Thread JMeybohm
JMeybohm closed this task as "Resolved". JMeybohm added a comment. Thanks, closing then. TASK DETAIL https://phabricator.wikimedia.org/T287443 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: JMeybohm, dcausse, Aklapper

[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server

2021-07-28 Thread JMeybohm
JMeybohm added a comment. That is because your application is reading default kubernetes environment variables which carry the ClusterIP of `kubernetes.default.svc.cluster.local` instead of it's name. The ClusterIP we unfortunately don't have in the certificate on the actual servers

[Wikidata-bugs] [Maniphest] T287443: Flink jobmanager and taskmanager cannot talk to the k8s api server

2021-07-27 Thread JMeybohm
JMeybohm claimed this task. JMeybohm added a comment. Looking into this. Problem is that we currently do not allow Pods to access the Kubernetes API servers (Egress rule is missing) and it's not super trivial to allow that in a transparent way (e.g. without having to declare the API

[Wikidata-bugs] [Maniphest] T285219: cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503

2021-07-19 Thread JMeybohm
JMeybohm added a subscriber: RLazarus. JMeybohm added a comment. Picking up from the IRC conversation yesterday @RLazarus figured that the response body looks like it is https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/errorpages/503.html At the time this issue

[Wikidata-bugs] [Maniphest] T273098: High Availability Flink

2021-04-14 Thread JMeybohm
JMeybohm added a comment. I do see that using the configmap election method is appealing as it is build in and does not require additional software to function. Unfortunately I was not able to understand (by briefly reading the docs) if this uses a separate configmap or the one

[Wikidata-bugs] [Maniphest] T276550: Missing alerts for Termbox staging and test services

2021-03-09 Thread JMeybohm
JMeybohm added a comment. It was more a matter of a day than month (as we just upgraded the kubernetes version in staging). Also we don't enable monitoring for staging in general, but of cause errors like that should be catched at deploy time. This can currently be done by running `helmfile

[Wikidata-bugs] [Maniphest] T264821: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes

2020-10-13 Thread JMeybohm
JMeybohm triaged this task as "Medium" priority. TASK DETAIL https://phabricator.wikimedia.org/T264821 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Michael, RhinosF1, Joe, LSobanski, Addshore, Ladsgroup, RLazarus,

[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-14 Thread JMeybohm
JMeybohm added a comment. Looking at the values today it's pretty clear that mw1382 wins and mw1381 takes the second place. The overall memory usage looks like it's safe to leave it this way over the weekend. On Monday we should reboot the clusters again, with "cgroup.memory=n

[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata

[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata

[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata

[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata

[Wikidata-bugs] [Maniphest] T260329: Figure what change caused the ongoing memleak on mw appservers

2020-08-13 Thread JMeybohm
JMeybohm updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T260329 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Ladsgroup, Tarrow, Addshore, CDanis, Aklapper, jijiki, ArielGlenn, RhinosF1, Joe, lmata

[Wikidata-bugs] [Maniphest] [Commented On] T255410: Termbox SSR connection terminated very often

2020-06-16 Thread JMeybohm
JMeybohm added a comment. @Michael thanks for writing this up! So, if it is safe to assume the MW -> termbox timeout is 3s I would suggest we configure the envoys accordingly by setting `tls.upstream_timeout: "3s"` in termbox values.yaml as well as `timeout: "3s&quo