Starting maintenance.

On Wed, Oct 1, 2025 at 11:54 AM Clément Goubert <[email protected]>
wrote:

> Hello everyone,
>
> An update on the status of Maps for the upcoming upgrade.
>
>
> *Short version:*
>
>
> *Maps will serve some stale map tiles for the next few hours.*
> *Rationale:*
> The OSM map tile cache is still being refreshed, there are a lot of
> elements to fetch and we couldn't make that happen before the upgrade. This
> refresh will keep happening during the migration, so the amount of stale
> tiles served will go down as time passes. We decided this was the best of
> the three options available to us, the other two being depooling the
> service entirely and having maps be unavailable for the duration of the
> maintenance, and pushing the date of the upgrade in the future, which would
> snowball into pushing back the eqiad repool.
>
> ---
>
> Object: Kubernetes upgrade to 1.31
>
> Target: eqiad Wikikube cluster
>
> Maintenance window: 2025-10-01 10:00
> <https://zonestamp.toolforge.org/1759312800>-15:00
> <https://zonestamp.toolforge.org/1759330800> UTC
>
> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to
> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703>
>
> Operational channel: IRC #wikimedia-sre
> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>, announcements
> will be made to IRC #wikimedia-operations
> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations>
>
> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops
> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>)
>
> Impact:
>
> Users:
>
>    -
>
>    Toolhub will be down for the duration of the window.
>    -
>
>       Maps may experience some perturbation during this maintenance, most
>       probably serving stale map tiles while the cache is being refreshed.
>
>
>    -
>
>    No user impact for other services
>
> Deployers:
>
>    -
>
>    Deployments to the target cluster will be unavailable. This includes
>    MediaWiki backports and deployments. DO NOT DEPLOY.
>    -
>
>    The following deployment windows are cancelled:
>    -
>
>       Services: Citoid/Zotero 11:00 UTC
>       <https://zonestamp.toolforge.org/1759316400>
>       -
>
>       UTC Afternoon Backport Window 13:00 UTC
>       <https://zonestamp.toolforge.org/1759330800>
>       -
>
>       Wikifunctions Services UTC Afternoon 14:00 UTC
>       <https://zonestamp.toolforge.org/1759327200>
>
> Process:
>
> All steps handled by SRE ServiceOps
>
>    -
>
>    Maintenance start is announced on #wikimedia-operations and as reply
>    to this email chain
>    -
>
>    All deployments are stopped
>    -
>
>    SRE ServiceOps ensures all current versions of deployments can be
>    safely deployed
>    -
>
>    Maintenance begins and should take a couple of hours
>    -
>
>    Maps is switched over to codfw new stack, perturbations may start
>    -
>
>    Toolhub downtime starts
>    -
>
>    Possible Maps fallback to codfw old stack
>    -
>
>    Cluster is wiped and upgraded
>    -
>
>    Maps and Toolhub are redeployed first to minimize downtime
>    -
>
>    Maps is switched back to eqiad, perturbations end
>    -
>
>    Toolhub downtime stops
>    -
>
>    SRE ServiceOps redeploys all target cluster services
>    -
>
>    Maintenance end is announced on #wikimedia-operations and as reply to
>    this email chain
>    -
>
>    Deployments resume
>
> Rationale:
>
> The date was chosen for convenience as due to the data center switchover
> process <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad is
> currently fully depooled, receiving almost no traffic. eqiad is scheduled
> to be repooled on 2025-10-02 <https://zonestamp.toolforge.org/1759417200>,
> which would complicate the upgrade. With eqiad already drained, we expect
> no visible user impact.
>
> SRE ServiceOps will be checking that all services can be safely deployed
> before the maintenance, and will be redeploying all services before marking
> the cluster as usable. Deployers are not required to  re-deploy their
> services, unless they have been informed to do so by SRE ServiceOps.
>
> During last week’s switchover <https://phabricator.wikimedia.org/T399891>,
> Toolhub remained in eqiad. This means that there will be an expected
> unavoidable small downtime of a few hours. To minimize Toolhub’s downtime,
> we will prioritize its redeployment during the initialization phase.
>
> As part of the work to upgrade the Maps infrastructure
> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian
> service to Wikikube, kartotherian is currently single-homed in eqiad
> Wikikube, using the old buster-based stack as a backend. The new
> bookworm-based stack in codfw is being brought up quickly, so we will use
> this maintenance as an opportunity to shift traffic to it (Case 1). In
> addition, we are also warming up the old buster-based stack in codfw so we
> can fall back to it in case issues arise (Case 2). As of 15 minutes before
> the maintenance, the OSM map tile cache is still being refreshed. There
> are a lot of elements to fetch and we couldn't make that happen before the
> upgrade. This refresh will keep happening during the migration, so the
> amount of stale tiles served will go down as time passes. We decided this
> was the best of the three options available to us, the other two being
> depooling the service entirely and having maps be unavailable for the
> duration of the maintenance, and pushing the date of the upgrade in the
> future, which would snowball into pushing back the eqiad repool.
>
> Thank you for your understanding and support! If you have any questions
> regarding this process, please respond to this email, comment on
> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31
> <https://phabricator.wikimedia.org/T405703>, or reach out directly to me
> (IRC nickname claime on #wikimedia-serviceops
> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>).
>
> On behalf of SRE ServiceOps,
>
>
> On Tue, Sep 30, 2025 at 4:54 PM Clément Goubert <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>> A quick update on additional impact for the upcoming maintenance. Fully
>> updated maintenance description at the end of this email.
>>
>> Short version:
>>
>> The Maps infrastructure may experience some perturbation during this
>> maintenance.
>>
>> Impact:
>>
>> Users:
>>
>>    -
>>
>>    Case 1: The new bookworm-based codfw stack performs well and service
>>    disruption should be minimal
>>    -
>>
>>    Case 2: If errors are experienced with the new codfw stack, the
>>    fallback to the old codfw stack will come with some OSM-data lag, as yet
>>    unmeasurable
>>
>> Mitigation:
>>
>>    -
>>
>>    Maps will be redeployed with the same priority as Toolhub to minimize
>>    downtime.
>>
>> Rationale:
>>
>> As part of the work to upgrade the Maps infrastructure
>> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian
>> service to Wikikube, kartotherian is currently single-homed in eqiad
>> Wikikube, using the old buster-based stack as a backend.
>>
>> The new bookworm-based stack in codfw is being brought up quickly, so we
>> will use this maintenance as an opportunity to shift traffic to it (case
>> 1). In addition, we are also warming up the old buster-based stack in codfw
>> so we can fall back to it in case issues arise (case 2).
>>
>> ---
>>
>> Object: Kubernetes upgrade to 1.31
>>
>> Target: eqiad Wikikube cluster
>>
>> Maintenance window: 2025-10-01 10:00
>> <https://zonestamp.toolforge.org/1759312800>-15:00
>> <https://zonestamp.toolforge.org/1759330800> UTC
>>
>> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to
>> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703>
>>
>> Operational channel: IRC #wikimedia-sre
>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>, announcements
>> will be made to IRC #wikimedia-operations
>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations>
>>
>> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops
>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>)
>>
>> Impact:
>>
>> Users:
>>
>>    -
>>
>>    Toolhub will be down for the duration of the window.
>>    -
>>
>>       Maps may experience some perturbation during this maintenance.
>>
>>
>>    -
>>
>>    No user impact for other services
>>
>> Deployers:
>>
>>    -
>>
>>    Deployments to the target cluster will be unavailable. This includes
>>    MediaWiki backports and deployments. DO NOT DEPLOY.
>>    -
>>
>>    The following deployment windows are cancelled:
>>    -
>>
>>       Services: Citoid/Zotero 11:00 UTC
>>       <https://zonestamp.toolforge.org/1759316400>
>>       -
>>
>>       UTC Afternoon Backport Window 13:00 UTC
>>       <https://zonestamp.toolforge.org/1759330800>
>>       -
>>
>>       Wikifunctions Services UTC Afternoon 14:00 UTC
>>       <https://zonestamp.toolforge.org/1759327200>
>>
>> Process:
>>
>> All steps handled by SRE ServiceOps
>>
>>    -
>>
>>    Maintenance start is announced on #wikimedia-operations and as reply
>>    to this email chain
>>    -
>>
>>    All deployments are stopped
>>    -
>>
>>    SRE ServiceOps ensures all current versions of deployments can be
>>    safely deployed
>>    -
>>
>>    Maintenance begins and should take a couple of hours
>>    -
>>
>>    Maps is switched over to codfw new stack, perturbations may start
>>    -
>>
>>    Toolhub downtime starts
>>    -
>>
>>    Possible Maps fallback to codfw old stack
>>    -
>>
>>    Cluster is wiped and upgraded
>>    -
>>
>>    Maps and Toolhub are redeployed first to minimize downtime
>>    -
>>
>>    Maps is switched back to eqiad, perturbations end
>>    -
>>
>>    Toolhub downtime stops
>>    -
>>
>>    SRE ServiceOps redeploys all target cluster services
>>    -
>>
>>    Maintenance end is announced on #wikimedia-operations and as reply to
>>    this email chain
>>    -
>>
>>    Deployments resume
>>
>> Rationale:
>>
>> The date was chosen for convenience as due to the data center switchover
>> process <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad
>> is currently fully depooled, receiving almost no traffic. eqiad is
>> scheduled to be repooled on 2025-10-02
>> <https://zonestamp.toolforge.org/1759417200>, which would complicate the
>> upgrade. With eqiad already drained, we expect no visible user impact.
>>
>> SRE ServiceOps will be checking that all services can be safely deployed
>> before the maintenance, and will be redeploying all services before marking
>> the cluster as usable. Deployers are not required to  re-deploy their
>> services, unless they have been informed to do so by SRE ServiceOps.
>>
>> During last week’s switchover <https://phabricator.wikimedia.org/T399891>,
>> Toolhub remained in eqiad. This means that there will be an expected
>> unavoidable small downtime of a few hours. To minimize Toolhub’s downtime,
>> we will prioritize its redeployment during the initialization phase.
>>
>> As part of the work to upgrade the Maps infrastructure
>> <https://phabricator.wikimedia.org/T381565> and bring the kartotherian
>> service to Wikikube, kartotherian is currently single-homed in eqiad
>> Wikikube, using the old buster-based stack as a backend. The new
>> bookworm-based stack in codfw is being brought up quickly, so we will use
>> this maintenance as an opportunity to shift traffic to it (Case 1). In
>> addition, we are also warming up the old buster-based stack in codfw so we
>> can fall back to it in case issues arise (Case 2).
>>
>> Thank you for your understanding and support! If you have any questions
>> regarding this process, please respond to this email, comment on
>> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31
>> <https://phabricator.wikimedia.org/T405703>, or reach out directly to me
>> (IRC nickname claime on #wikimedia-serviceops
>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>).
>>
>> On behalf of SRE ServiceOps,
>>
>> On Mon, Sep 29, 2025 at 5:37 PM Clément Goubert <[email protected]>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> Short version:
>>>
>>> We will be upgrading the eqiad Wikikube kubernetes
>>> <https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#WikiKube>
>>> cluster to 1.31 on Wednesday 2025-10-01 starting at 10:00 UTC
>>> <https://zonestamp.toolforge.org/1759312800>, ending at 15:00 UTC
>>> <https://zonestamp.toolforge.org/1759330800>.
>>>
>>> Toolhub will be down during this maintenance.
>>>
>>> If you are deploying services to the eqiad Wikikube kubernetes cluster:
>>>
>>>    -
>>>
>>>    Deployments will be unavailable during the maintenance. DO NOT
>>>    DEPLOY.
>>>    -
>>>
>>>    SRE will redeploy all services
>>>    -
>>>
>>>    SRE will announce the end of maintenance, at which point the cluster
>>>    will be usable again
>>>
>>> ---
>>>
>>> Object: Kubernetes upgrade to 1.31
>>>
>>> Target: eqiad Wikikube cluster
>>>
>>> Maintenance window: 2025-10-01 10:00
>>> <https://zonestamp.toolforge.org/1759312800>-15:00
>>> <https://zonestamp.toolforge.org/1759330800> UTC
>>>
>>> Tracking task: Phabricator at ⚓T405703 Update wikikube eqiad to
>>> kubernetes 1.31 <https://phabricator.wikimedia.org/T405703>
>>>
>>> Operational channel: IRC #wikimedia-sre
>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-sre>,
>>> announcements will be made to IRC #wikimedia-operations
>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-operations>
>>>
>>> Operating team: SRE ServiceOps (contact IRC #wikimedia-serviceops
>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>)
>>>
>>> Impact:
>>>
>>> Users:
>>>
>>>    -
>>>
>>>    Toolhub will be down for the duration of the window.
>>>    -
>>>
>>>    No user impact for other services.
>>>
>>> Deployers:
>>>
>>>    -
>>>
>>>    Deployments to the target cluster will be unavailable. This includes
>>>    MediaWiki backports and deployments. DO NOT DEPLOY.
>>>    -
>>>
>>>    The following deployment windows are cancelled:
>>>    -
>>>
>>>       Services: Citoid/Zotero 11:00 UTC
>>>       <https://zonestamp.toolforge.org/1759316400>
>>>       -
>>>
>>>       UTC Afternoon Backport Window 13:00 UTC
>>>       <https://zonestamp.toolforge.org/1759330800>
>>>       -
>>>
>>>       Wikifunctions Services UTC Afternoon 14:00 UTC
>>>       <https://zonestamp.toolforge.org/1759327200>
>>>
>>> Process:
>>>
>>> All steps handled by SRE ServiceOps
>>>
>>>    -
>>>
>>>    Maintenance start is announced on #wikimedia-operations and as reply
>>>    to this email chain
>>>    -
>>>
>>>    All deployments are stopped
>>>    -
>>>
>>>    SRE ServiceOps ensures all current versions of deployments can be
>>>    safely deployed
>>>    -
>>>
>>>    Maintenance begins and should take a couple of hours
>>>    -
>>>
>>>    Toolhub downtime starts
>>>    -
>>>
>>>    Cluster is wiped and upgraded
>>>    -
>>>
>>>    Toolhub is redeployed first to minimize downtime
>>>    -
>>>
>>>    Toolhub downtime stops
>>>    -
>>>
>>>    SRE ServiceOps redeploys all target cluster services
>>>    -
>>>
>>>    Maintenance end is announced on #wikimedia-operations and as reply
>>>    to this email chain
>>>    -
>>>
>>>    Deployments resume
>>>
>>> Rationale:
>>>
>>> The date was chosen for convenience as due to the data center
>>> switchover process
>>> <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, eqiad is
>>> currently fully depooled, receiving almost no traffic. eqiad is scheduled
>>> to be repooled on 2025-10-02
>>> <https://zonestamp.toolforge.org/1759417200>, which would complicate
>>> the upgrade. With eqiad already drained, we expect no visible user impact.
>>>
>>> SRE ServiceOps will be checking that all services can be safely deployed
>>> before the maintenance, and will be redeploying all services before marking
>>> the cluster as usable. Deployers are not required to  re-deploy their
>>> services, unless they have been informed to do so by SRE ServiceOps.
>>>
>>> During last week’s switchover
>>> <https://phabricator.wikimedia.org/T399891>, Toolhub remained in eqiad.
>>> This means that there will be an expected unavoidable small downtime of a
>>> few hours. To minimize Toolhub’s downtime, we will prioritize its
>>> redeployment during the initialization phase.
>>>
>>>
>>>
>>> Thank you for your understanding and support! If you have any questions
>>> regarding this process, please respond to this email, comment on
>>> Phabricator at ⚓T405703 Update wikikube eqiad to kubernetes 1.31
>>> <https://phabricator.wikimedia.org/T405703>, or reach out directly to
>>> me (IRC nickname claime on #wikimedia-serviceops
>>> <https://web.libera.chat/gamja/?nick=Guest#wikimedia-serviceops>).
>>>
>>> On behalf of SRE ServiceOps,
>>>
>>> --
>>> Clément 'claime' Goubert (they/them)
>>> Senior SRE
>>> Wikimedia Foundation
>>>
>>
>>
>> --
>> Clément 'claime' Goubert (they/them)
>> Senior SRE
>> Wikimedia Foundation
>>
>
>
> --
> Clément 'claime' Goubert (they/them)
> Senior SRE
> Wikimedia Foundation
>


-- 
Clément 'claime' Goubert (they/them)
Senior SRE
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to