Hello everyone, A quick update on additional impact for the upcoming maintenance. Fully updated maintenance description at the end of this email.
Short version: The Maps infrastructure may experience some perturbation during this maintenance. Impact: Users: - Case 1: The new bookworm-based codfw stack performs well and service disruption should be minimal - Case 2: If errors are experienced with the new codfw stack, the fallback to the old codfw stack will come with some OSM-data lag, as yet unmeasurable Mitigation: - Maps will be redeployed with the same priority as Toolhub to minimize downtime. Rationale: As part of the work to upgrade the Maps infrastructure <https://phabricator.wikimedia.org/T381565> and bring the kartotherian service to Wikikube, kartotherian is currently single-homed in eqiad Wikikube, using the old buster-based stack as a backend. The new bookworm-based stack in codfw is being brought up quickly, so we will use this maintenance as an opportunity to shift traffic to it (case 1). In addition, we are also warming up the old buster-based stack in codfw so we can fall back to it in case issues arise (case 2). --- Object: Kubernetes upgrade to 1.31 Target: eqiad Wikikube cluster Maintenance window: 2025-10-01 10:00 <https://zonestamp.toolforge.org/1759312800>-15:00 <https://zonestamp.toolforge.org/1759330800> UTC Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31 <https://phabricator.wikimedia.org/T405703> Operational channel: IRC #wikimedia-sre <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>, announcements will be made to IRC #wikimedia-operations <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>) Impact: Users: - Toolhub will be down for the duration of the window. - Maps may experience some perturbation during this maintenance. - No user impact for other services Deployers: - Deployments to the target cluster will be unavailable. This includes MediaWiki backports and deployments. DO NOT DEPLOY. - The following deployment windows are cancelled: - Services: Citoid/Zotero 11:00 UTC <https://zonestamp.toolforge.org/1759316400> - UTC Afternoon Backport Window 13:00 UTC <https://zonestamp.toolforge.org/1759330800> - Wikifunctions Services UTC Afternoon 14:00 UTC <https://zonestamp.toolforge.org/1759327200> Process: All steps handled by SRE ServiceOps - Maintenance start is announced on #wikimedia-operations and as reply to this email chain - All deployments are stopped - SRE ServiceOps ensures all current versions of deployments can be safely deployed - Maintenance begins and should take a couple of hours - Maps is switched over to codfw new stack, perturbations may start - Toolhub downtime starts - Possible Maps fallback to codfw old stack - Cluster is wiped and upgraded - Maps and Toolhub are redeployed first to minimize downtime - Maps is switched back to eqiad, perturbations end - Toolhub downtime stops - SRE ServiceOps redeploys all target cluster services - Maintenance end is announced on #wikimedia-operations and as reply to this email chain - Deployments resume Rationale: The date was chosen for convenience as due to the data center switchover process <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad is currently fully depooled, receiving almost no traffic. eqiad is scheduled to be repooled on 2025-10-02 <https://zonestamp.toolforge.org/1759417200>, which would complicate the upgrade. With eqiad already drained, we expect no visible user impact. SRE ServiceOps will be checking that all services can be safely deployed before the maintenance, and will be redeploying all services before marking the cluster as usable. Deployers are not required to re-deploy their services, unless they have been informed to do so by SRE ServiceOps. During last week’s switchover <https://phabricator.wikimedia.org/T399891>, Toolhub remained in eqiad. This means that there will be an expected unavoidable small downtime of a few hours. To minimize Toolhub’s downtime, we will prioritize its redeployment during the initialization phase. As part of the work to upgrade the Maps infrastructure <https://phabricator.wikimedia.org/T381565> and bring the kartotherian service to Wikikube, kartotherian is currently single-homed in eqiad Wikikube, using the old buster-based stack as a backend. The new bookworm-based stack in codfw is being brought up quickly, so we will use this maintenance as an opportunity to shift traffic to it (Case 1). In addition, we are also warming up the old buster-based stack in codfw so we can fall back to it in case issues arise (Case 2). Thank you for your understanding and support! If you have any questions regarding this process, please respond to this email, comment on Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31 <https://phabricator.wikimedia.org/T405703>, or reach out directly to me (IRC nickname claime on #wikimedia-serviceops <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>). On behalf of SRE ServiceOps, On Mon, Sep 29, 2025 at 5:37 PM Clément Goubert <[email protected]> wrote: > Hello everyone, > > Short version: > > We will be upgrading the eqiad Wikikube kubernetes > <https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#WikiKube> > cluster to 1.31 on Wednesday 2025-10-01 starting at 10:00 UTC > <https://zonestamp.toolforge.org/1759312800>, ending at 15:00 UTC > <https://zonestamp.toolforge.org/1759330800>. > > Toolhub will be down during this maintenance. > > If you are deploying services to the eqiad Wikikube kubernetes cluster: > > - > > Deployments will be unavailable during the maintenance. DO NOT DEPLOY. > - > > SRE will redeploy all services > - > > SRE will announce the end of maintenance, at which point the cluster > will be usable again > > --- > > Object: Kubernetes upgrade to 1.31 > > Target: eqiad Wikikube cluster > > Maintenance window: 2025-10-01 10:00 > <https://zonestamp.toolforge.org/1759312800>-15:00 > <https://zonestamp.toolforge.org/1759330800> UTC > > Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to > kubernetes 1.31 <https://phabricator.wikimedia.org/T405703> > > Operational channel: IRC #wikimedia-sre > <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>, announcements > will be made to IRC #wikimedia-operations > <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations> > > Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops > <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>) > > Impact: > > Users: > > - > > Toolhub will be down for the duration of the window. > - > > No user impact for other services. > > Deployers: > > - > > Deployments to the target cluster will be unavailable. This includes > MediaWiki backports and deployments. DO NOT DEPLOY. > - > > The following deployment windows are cancelled: > - > > Services: Citoid/Zotero 11:00 UTC > <https://zonestamp.toolforge.org/1759316400> > - > > UTC Afternoon Backport Window 13:00 UTC > <https://zonestamp.toolforge.org/1759330800> > - > > Wikifunctions Services UTC Afternoon 14:00 UTC > <https://zonestamp.toolforge.org/1759327200> > > Process: > > All steps handled by SRE ServiceOps > > - > > Maintenance start is announced on #wikimedia-operations and as reply > to this email chain > - > > All deployments are stopped > - > > SRE ServiceOps ensures all current versions of deployments can be > safely deployed > - > > Maintenance begins and should take a couple of hours > - > > Toolhub downtime starts > - > > Cluster is wiped and upgraded > - > > Toolhub is redeployed first to minimize downtime > - > > Toolhub downtime stops > - > > SRE ServiceOps redeploys all target cluster services > - > > Maintenance end is announced on #wikimedia-operations and as reply to > this email chain > - > > Deployments resume > > Rationale: > > The date was chosen for convenience as due to the data center switchover > process <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad is > currently fully depooled, receiving almost no traffic. eqiad is scheduled > to be repooled on 2025-10-02 <https://zonestamp.toolforge.org/1759417200>, > which would complicate the upgrade. With eqiad already drained, we expect > no visible user impact. > > SRE ServiceOps will be checking that all services can be safely deployed > before the maintenance, and will be redeploying all services before marking > the cluster as usable. Deployers are not required to re-deploy their > services, unless they have been informed to do so by SRE ServiceOps. > > During last week’s switchover <https://phabricator.wikimedia.org/T399891>, > Toolhub remained in eqiad. This means that there will be an expected > unavoidable small downtime of a few hours. To minimize Toolhub’s downtime, > we will prioritize its redeployment during the initialization phase. > > > > Thank you for your understanding and support! If you have any questions > regarding this process, please respond to this email, comment on > Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31 > <https://phabricator.wikimedia.org/T405703>, or reach out directly to me > (IRC nickname claime on #wikimedia-serviceops > <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>). > > On behalf of SRE ServiceOps, > > -- > Clément 'claime' Goubert (they/them) > Senior SRE > Wikimedia Foundation > -- Clément 'claime' Goubert (they/them) Senior SRE Wikimedia Foundation
_______________________________________________ Wikitech-l mailing list -- [email protected] To unsubscribe send an email to [email protected] https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
