[Cloud] [Cloud-announce] Downtime for select VMs next week, 2017-10-04

2017-09-29 Thread Andrew Bogott
In order to rebuild a server of questionable stability, I'm going to move the following instances on Wednesday: |+--+-++|| ||| Name | Tenant ID | Status | || ||+--+-+

Re: [Cloud] [Cloud-announce] Downtime for select VMs next week, 2017-10-04

2017-10-04 Thread Andrew Bogott
Reminder: I'm going to start migrating these hosts shortly. On 9/29/17 3:57 PM, Andrew Bogott wrote: In order to rebuild a server of questionable stability, I'm going to move the following instances on Wednesday: |+--+-++|

Re: [Cloud] [Cloud-announce] Downtime for select VMs next week, 2017-10-04 (finished)

2017-10-04 Thread Andrew Bogott
This is done now. On 10/4/17 8:15 AM, Andrew Bogott wrote: Reminder: I'm going to start migrating these hosts shortly. On 9/29/17 3:57 PM, Andrew Bogott wrote: In order to rebuild a server of questionable stability, I'm going to move the following instances on

[Cloud] [Cloud-announce] Do you need Ubuntu Trusty for new VMs?

2017-10-10 Thread Andrew Bogott
Tool-forge users can ignore this email, it only concerns VPS project owners. Long ago, the Wikimedia Operations team made the decision to phase out use of Ubuntu servers in favor of Debian. It's a long, slow process that is still ongoing, but in production Trusty is running on an eve

[Cloud] [Cloud-announce] Ubuntu trusty now deprecated for new WMCS instances

2017-11-20 Thread Andrew Bogott
As discussed previously in this list [1] and on phabricator [2], I've just removed the Ubuntu Trusty image as a default option when creating new VMs. This is part of a longterm foundation-wide process to standardize on Debian as the distribution of choice. Existing Trusty VMs are unaffected b

Re: [Cloud] Project specific standalone puppetmasters broken and the fix.

2017-11-29 Thread Andrew Bogott
On 11/29/17 1:31 PM, Chase Pettet wrote: A series of upgrades and changes have left instances with 'role::puppetmaster::standalone' applied in a broken state. This is unfortunate because Puppet is unable to fix itself. There is a small manual update required. I believe that I've now fixed all

[Cloud] [Cloud-announce] Partial toolforge outage ongoing

2017-12-12 Thread Andrew Bogott
Hello all, Some tools running on the Toolforge Kubernetes cluster are currently suffering from network failures. It's not yet fully diagnosed, although we have some ideas as to how to at least reduce the impact. The tracking bug is https://phabricator.wikimedia.org/T182722. We'll send anot

Re: [Cloud] [Cloud-announce] Partial toolforge outage ongoing -- resolved

2017-12-12 Thread Andrew Bogott
Toolforge services should be back to normal now. The problem is not yet fully understood, but details will trickle in on the tracking task, below. On 12/12/17 6:08 PM, Andrew Bogott wrote: Hello all, Some tools running on the Toolforge Kubernetes cluster are currently suffering from

[Cloud] [Cloud-announce] Maintenance reboots upcoming

2018-01-04 Thread Andrew Bogott
Sometime soon (probably in the next day or two) we will be applying kernel patches to all VMs and physical hosts in WMCS. This is to address an urgent security issue[1] , so we'll be skipping the traditional 7-day warning period -- basically as soon as proper fixes are available we'll start pat

Re: [Cloud] [Cloud-announce] Maintenance reboots upcoming (some of them TODAY)

2018-01-11 Thread Andrew Bogott
xy: yandex-proxy01 On 1/4/18 9:28 AM, Andrew Bogott wrote: Sometime soon (probably in the next day or two) we will be applying kernel patches to all VMs and physical hosts in WMCS. This is to address an urgent security issue[1] , so we'll be skipping the traditional 7-day warning period

Re: [Cloud] [Cloud-announce] Maintenance reboots upcoming -- Today's are finished but more coming next week

2018-01-11 Thread Andrew Bogott
ndrew On 1/11/18 1:02 PM, Andrew Bogott wrote: In a few minutes I'm going to start the first round of reboots.  We're going to do a subset of the cloud and then make sure there are no bad effects before doing the remainder on Monday. The following VMs will be upgraded and reboot

[Cloud] [Cloud-announce] Maintenance reboots TODAY

2018-01-16 Thread Andrew Bogott
make sure your jobs are still running after windows like this.  The list of VMs from last week (attached below) are already good to go so they should be unaffected today. -Andrew On 1/11/18 3:15 PM, Andrew Bogott wrote: Today's round of reboots is now finished -- the hosts rebooted are

Re: [Cloud] [Cloud-announce] Maintenance reboots -- finished

2018-01-16 Thread Andrew Bogott
The reboots are now done and everything is upgraded.  So far things seem back to normal, but visit us in #wikimedia-cloud if you find things amiss. -Andrew (+ WMCS team) On 1/16/18 8:57 AM, Andrew Bogott wrote: Good morning! The canary reboots last week went well, so we'll be upgradin

[Cloud] Brief Wikitech and Horizon outage upcoming

2018-01-17 Thread Andrew Bogott
As part of a security upgrade, I'll be rebooting the systems that host Wikitech and Horizon in about two hours, at 14:00 PST (16:00 CST). Those websites will be briefly unavailable, as will be the Nova api.  This last will cause a brief interruption to the WMF Continuous Integration system.  T

Re: [Cloud] Brief Wikitech and Horizon outage upcoming -- DONE

2018-01-17 Thread Andrew Bogott
On 1/17/18 2:19 PM, Andrew Bogott wrote: As part of a security upgrade, I'll be rebooting the systems that host Wikitech and Horizon in about two hours, at 14:00 PST (16:00 CST). These reboots are done and everything is back up.  Sorry for any inconvenience caused! -Andrew Those web

Re: [Cloud] [Cloud-announce] Cloud VPS single hypervisor failure and (some) down instances

2018-02-14 Thread Andrew Bogott
On 2/14/18 6:58 AM, Chase Pettet wrote: We lost a KVM host at around 7:20 UTC.  Because we use local storage for instances there are a number of them that are down.  Toolforge suffered a few losses but it seems to have been few enough that GridEngine and Kubernetes users are unaffected at thi

Re: [Cloud] [Cloud-announce] Cloud VPS single hypervisor failure and (some) down instances (possibly resolved)

2018-02-14 Thread Andrew Bogott
ry for the downtime! -Andrew + the WMCS team On 2/14/18 8:29 AM, Andrew Bogott wrote: On 2/14/18 6:58 AM, Chase Pettet wrote: We lost a KVM host at around 7:20 UTC.  Because we use local storage for instances there are a number of them that are down.  Toolforge suffered a few losses but it seems to

[Cloud] [Cloud-announce] Wikitech move this Friday, 2018-03-09 at 16:00 UTC

2018-03-07 Thread Andrew Bogott
On Friday morning my time (10:00 CST, 8:00 PST, 16:00 UTC) I'll be switching the dns record for wikitech.wikimedia.org to point to a new server.  This change should be largely invisible to users, but there are a few things to be ready for: - Most importantly, YOU WILL BE LOGGED OUT of Wikitech

Re: [Cloud] [Cloud-announce] Wikitech move this Friday, 2018-03-09 at 16:00 UTC

2018-03-09 Thread Andrew Bogott
Reminder:  This is happening in about an hour. On 3/7/18 4:14 PM, Andrew Bogott wrote: On Friday morning my time (10:00 CST, 8:00 PST, 16:00 UTC) I'll be switching the dns record for wikitech.wikimedia.org to point to a new server.  This change should be largely invisible to users, but

Re: [Cloud] [Cloud-announce] Wikitech move this Friday, 2018-03-09 at 16:00 UTC

2018-03-09 Thread Andrew Bogott
This is done.  Quick tests suggest that everything is working fine, but don't hesitate to contact me if you see any strange behavior. -Andrew On 3/9/18 8:57 AM, Andrew Bogott wrote: Reminder: This is happening in about an hour. On 3/7/18 4:14 PM, Andrew Bogott wrote: On Friday morni

[Cloud] Updates to Horizon and Toolsadmin coming up next week

2018-03-09 Thread Andrew Bogott
tl;dr: Starting on Wednesday, the Horizon UI is going to look a bit different. -- On Wednesday next week I'm going to switch Horizon and Toolsadmin traffic away from their current physical host and over to new hardware.  The change to Toolsadmin will be largely invisible, but the Horizon swi

[Cloud] [Cloud-announce] Brief service interruption tomorrow 2018-03-28 at 15:00 UTC

2018-03-27 Thread Andrew Bogott
About 24 hours from now we're going to reboot a couple of servers[1] in the cloud infrastructure to apply security updates. Few WMCS users (and, in particular, no tools users) should notice any interruption.  Nonetheless, a few services will be down: - New instance creation will fail - CI t

Re: [Cloud] [Cloud-announce] Brief service interruption tomorrow 2018-03-28 at 15:00 UTC

2018-03-28 Thread Andrew Bogott
All done! On 3/27/18 9:34 AM, Andrew Bogott wrote: About 24 hours from now we're going to reboot a couple of servers[1] in the cloud infrastructure to apply security updates. Few WMCS users (and, in particular, no tools users) should notice any interruption.  Nonetheless, a few services

[Cloud] [Cloud-announce] Horizon and CI downtime Friday, 2018-04-13

2018-04-05 Thread Andrew Bogott
    Next Friday we'll be upgrading our OpenStack cluster.  The upgrade should not interrupt any existing tools or instances, but during the upgrade it will be impossible to create, delete, or modify WMCS VMs.     I'll start the process at around 02:00 UTC (7AM PDT).  The complete upgrade may t

Re: [Cloud] [Cloud-announce] Horizon and CI downtime Friday, 2018-04-13 (time correction)

2018-04-05 Thread Andrew Bogott
On 4/5/18 1:17 PM, Andrew Bogott wrote:     Next Friday we'll be upgrading our OpenStack cluster.  The upgrade should not interrupt any existing tools or instances, but during the upgrade it will be impossible to create, delete, or modify WMCS VMs.     I'll start the process at ar

[Cloud] [Cloud-announce] Horizon and CI downtime Tomorrow, 2018-04-13

2018-04-12 Thread Andrew Bogott
    Next Friday we'll be upgrading our OpenStack cluster.  The upgrade should not interrupt any existing tools or instances, but during the upgrade it will be impossible to create, delete, or modify WMCS VMs.     I'll start the process at around 14:00 UTC (7AM PDT).  The complete upgrade may t

Re: [Cloud] [Cloud-announce] Horizon and CI downtime Tomorrow, 2018-04-13

2018-04-13 Thread Andrew Bogott
Reminder:  This is starting in a few minutes. -A On 4/12/18 8:58 AM, Andrew Bogott wrote:     Next Friday we'll be upgrading our OpenStack cluster.  The upgrade should not interrupt any existing tools or instances, but during the upgrade it will be impossible to create, delete, or modify

Re: [Cloud] [Cloud-announce] Horizon and CI downtime Tomorrow, 2018-04-13 -- DONE

2018-04-13 Thread Andrew Bogott
The upgrades are done, and Horizon and CI are re-enabled.  Please let me know if you find any new problems. -Andrew On 4/12/18 8:58 AM, Andrew Bogott wrote:     Next Friday we'll be upgrading our OpenStack cluster.  The upgrade should not interrupt any existing tools or instances, but d

[Cloud] [Cloud-announce] Upcoming WMCS network outages: Tuesday May 15th

2018-05-02 Thread Andrew Bogott
As part of some long-deferred routine maintenance, we need to update (and, in one case, physically move) the network servers that handle all traffic between WMCS instances.  During each change, all WMCS network traffic (including network access to all tools and VMs) will be interrupted for seve

Re: [Cloud] [Cloud-announce] Upcoming WMCS network outages: Tuesday May 15th

2018-05-14 Thread Andrew Bogott
Reminder:  this outage is happening tomorrow. On 5/2/18 10:22 AM, Andrew Bogott wrote: As part of some long-deferred routine maintenance, we need to update (and, in one case, physically move) the network servers that handle all traffic between WMCS instances.  During each change, all WMCS

Re: [Cloud] [Cloud-announce] Upcoming WMCS network outages: Tuesday May 15th

2018-05-15 Thread Andrew Bogott
The first of these outages is coming up in a few minutes. On 5/14/18 12:02 PM, Andrew Bogott wrote: Reminder:  this outage is happening tomorrow. On 5/2/18 10:22 AM, Andrew Bogott wrote: As part of some long-deferred routine maintenance, we need to update (and, in one case, physically move

Re: [Cloud] [Cloud-announce] Upcoming WMCS network outages: Tuesday May 15th

2018-05-15 Thread Andrew Bogott
The first of these tasks is done and the network is back up and running.  The outage lasted a bit less than 10 minutes. There will be another similar outage in a few hours. -Andrew On 5/2/18 10:22 AM, Andrew Bogott wrote: As part of some long-deferred routine maintenance, we need to update

Re: [Cloud] [Cloud-announce] Upcoming WMCS network outages: Tuesday May 15th

2018-05-15 Thread Andrew Bogott
Things are back up and running for the moment.  The last switch-over went poorly so we haven't actually reached our goals yet; there may be another interruption yet coming up. -A On 5/15/18 8:33 AM, Andrew Bogott wrote: The first of these tasks is done and the network is back up and ru

Re: [Cloud] [Cloud-announce] Upcoming WMCS network outages: Tuesday May 15th

2018-05-15 Thread Andrew Bogott
de as much warning about that as I can.  It's unlikely to be today, in any case. Sorry for any inconvenience caused! -Andrew On 5/15/18 12:04 PM, Andrew Bogott wrote: Things are back up and running for the moment.  The last switch-over went poorly so we haven't actually reached our

Re: [Cloud] [Cloud-announce] Upcoming WMCS network outages: Tuesday May 15th

2018-05-15 Thread Andrew Bogott
The next step in this is scheduled for tomorrow at at 15:00 UTC, 8:00AM in SF.  Again, all network service will be interrupted for 5-10 minutes. Sorry for all the emails!  With luck there will only be one more. -Andrew On 5/15/18 12:24 PM, Andrew Bogott wrote: We're leaving things in th

Re: [Cloud] [Cloud-announce] Upcoming WMCS network outages: Tuesday May 15th (DONE)

2018-05-16 Thread Andrew Bogott
We had a couple of minutes of downtime just now, and everything is back up.  This went a lot better today; this should be the last of these network interruptions for a while. -Andrew On 5/15/18 3:31 PM, Andrew Bogott wrote: The next step in this is scheduled for tomorrow at at 15:00 UTC, 8

[Cloud] [Cloud-announce] Reduced support this week and next

2018-05-16 Thread Andrew Bogott
Hello! The Cloud Services team is traveling quite a bit in the next few weeks: the Hackathon, the OpenStack Summit, and some personal travel.  There will always be at least one person available for emergencies, but please be patient if we're slow to respond to requests. Everyone should be ba

[Cloud] [Cloud-announce] Rolling VM reboots next Wednesday, 2018-06-06, beginning at 14:00 UTC

2018-05-30 Thread Andrew Bogott
As part of routine security maintenance, we'll be rebooting all VMs and virtualization hosts next Wednesday starting at 14:00 UTC (7AM San Francisco time). Toolforge users should be largely unaffected by this activity. Other projects (including deployment-prep) will experience sporadic downtim

Re: [Cloud] [Cloud-announce] Rolling VM reboots next Wednesday, 2018-06-06, beginning at 14:00 UTC

2018-06-05 Thread Andrew Bogott
Reminder:  These reboots will start in about 12 hours. On 5/30/18 10:46 AM, Andrew Bogott wrote: As part of routine security maintenance, we'll be rebooting all VMs and virtualization hosts next Wednesday starting at 14:00 UTC (7AM San Francisco time). Toolforge users should be la

Re: [Cloud] [Cloud-announce] Rolling VM reboots next Wednesday, 2018-06-06, beginning at 14:00 UTC (DONE)

2018-06-06 Thread Andrew Bogott
Bogott wrote: Reminder: These reboots will start in about 12 hours. On 5/30/18 10:46 AM, Andrew Bogott wrote: As part of routine security maintenance, we'll be rebooting all VMs and virtualization hosts next Wednesday starting at 14:00 UTC (7AM San Francisco time). Toolforge users should be la

[Cloud] [Cloud-announce] limited staff availability next week (2018-06-16 through 2018-06-25)

2018-06-14 Thread Andrew Bogott
Hello! Much of the Cloud Services staff will be traveling and attending meetings next week.  There will always be someone available for emergencies, but routine support requests may get handled more slowly than usual. Things will be back to normal the following Monday, the 25th. - Andrew +

[Cloud] [Cloud-announce] VPS users, please claim your projects

2018-08-13 Thread Andrew Bogott
We're drawing close to a painful migration event[1], during which we will (probably) have to copy VMs between hosts one project at a time, largely by hand.  For that reason, I'm feeling even stingier than usual about preserving unused and/or abandoned projects and instances. It's been a couple

[Cloud] [Cloud-announce] VPS users, please claim your projects (second notice)

2018-08-26 Thread Andrew Bogott
In an attempt to identify abandoned VPS projects, I've created a wiki page that lists all existing projects, here: https://wikitech.wikimedia.org/wiki/Cloud_VPS_2018_Purge Currently 85 projects[2] on that list are unclaimed. If you are a VPS user, please visit that page and mark any projects

[Cloud] [Cloud-announce] 57 VPS projects are candidates for deletion

2018-09-17 Thread Andrew Bogott
There are currently 57 unclaimed projects on https://wikitech.wikimedia.org/wiki/Cloud_VPS_2018_Purge.  I will start shutting down unclaimed projects at the beginning of next month, and those projects will be left behind in the future network migration[1] and, eventually, deleted. If you see

[Cloud] [Cloud-announce] Network maintenance Thursday, 2018-10-04 at 16:00 UTC

2018-09-28 Thread Andrew Bogott
Routine network upgrades are scheduled for Thursday which may result in brief WMCS service interruptions.  In particular, Wikitech and Horizon may stop working, and instance creation/deletion/updating may briefly fail. The network engineers have reserved a two-hour window beginning at 16:00 UT

[Cloud] [Cloud-announce] Reminder: Network maintenance Thursday, 2018-10-04 at 16:00 UTC

2018-10-04 Thread Andrew Bogott
18 13:19:36 -0500 From: Andrew Bogott Reply-To: andrewbog...@gmail.com To: cloud-annou...@lists.wikimedia.org Routine network upgrades are scheduled for Thursday which may result in brief WMCS service interruptions.  In particular, Wikitech and Horizon may stop working, and ins

[Cloud] [Cloud-announce] Network interruptions next Wednesday, 2018-10-24 at 14:00 UTC

2018-10-17 Thread Andrew Bogott
Hello! The Wikimedia datacenter team will be performing some routine network maintenance[1] next Wednesday.  This will cause brief, rolling network interruptions for essentially all tools, services, and virtual servers -- each physical server will be briefly unplugged as its network cable is

[Cloud] [Cloud-announce] Neutron region migrations starting next Friday

2018-10-19 Thread Andrew Bogott
Beginning next week, the cloud team will start migrating projects to Neutron[1] in earnest.  I will attempt to reach out individually to affected project admins as well, but here is the upcoming migration schedule: Friday, 2018-10-19:  analytics Monday, 2018-10-22:   antiharassment, catgra

Re: [Cloud] [Cloud-announce] Neutron region migrations starting next Friday

2018-10-19 Thread Andrew Bogott
, mwstake Thursday, 2018-11-01: planet, pluggableauth, privpol-captcha, qna Friday, 2018-11-02: reading-web-staging, suggestbot, test-twemproxy, wikibase-registry, wikibrain, wikidata-primary-sources-tool On 10/19/18 10:18 AM, Andrew Bogott wrote: Beginning next week, the cloud team will start

Re: [Cloud] [Cloud-announce] Network interruptions next Wednesday, 2018-10-24 at 14:00 UTC

2018-10-24 Thread Andrew Bogott
Reminder:  This maintenance is starting in about 15 minutes. On 10/17/18 9:18 AM, Andrew Bogott wrote: Hello! The Wikimedia datacenter team will be performing some routine network maintenance[1] next Wednesday.  This will cause brief, rolling network interruptions for essentially all tools

[Cloud] [Cloud-announce] RESCHEDULED Network interruptions next Wednesday, 2018-10-31 at 14:00 UTC

2018-10-24 Thread Andrew Bogott
UPDATE:  This maintenance has been postponed a week due to our datacenter engineer being injured and unable to complete the work.  Instead, this will be happening next Wednesday instead, at the same time as previously scheduled. On 10/24/18 8:44 AM, Andrew Bogott wrote: Reminder: This

[Cloud] [Cloud-announce] Neutron region migrations, round two

2018-10-29 Thread Andrew Bogott
Region migration is going smoothly, and it's time to plan out the next week of moves. For details about what's happening here, consult the link below[1]. Here is the schedule for the next week of moves: Monday, 2018-11-05: discovery-stats, globaleducation, hat-imagescalers, language Tues

Re: [Cloud] [Cloud-announce] RESCHEDULED Network interruptions next Wednesday, 2018-10-31 at 14:00 UTC

2018-10-31 Thread Andrew Bogott
Reminder:  This maintenance is happening in about 45 minutes.  If all goes well it should be quick and largely unnoticeable. On 10/24/18 8:57 AM, Andrew Bogott wrote: UPDATE: This maintenance has been postponed a week due to our datacenter engineer being injured and unable to complete the

Re: [Cloud] [Cloud-announce] RESCHEDULED Network interruptions next Wednesday, 2018-10-31 at 14:00 UTC

2018-10-31 Thread Andrew Bogott
On 10/31/18 8:16 AM, Andrew Bogott wrote: Reminder: This maintenance is happening in about 45 minutes.  If all goes well it should be quick and largely unnoticeable. All done!  Let me know if you encounter any network issues; things look good from my end. On 10/24/18 8:57 AM, Andrew

[Cloud] [Cloud-announce] quarry.wmflabs.org downtime starting in about an hour

2018-11-05 Thread Andrew Bogott
    We'll be shuffling the VMs that host the Quarry service over to a new corner of the cloud today.  During the move the service will be unavailable and/or behave erratically.     I don't expect the move to take more than an hour.  I'll send a further notice when things are done. -Andrew +

Re: [Cloud] [Cloud-announce] quarry.wmflabs.org downtime starting in about an hour {{Done}}

2018-11-05 Thread Andrew Bogott
On 11/5/18 11:10 AM, Andrew Bogott wrote: We'll be shuffling the VMs that host the Quarry service over to a new corner of the cloud today.  During the move the service will be unavailable and/or behave erratically. This is all done -- Quarry is back up and working.     I don't

[Cloud] [Cloud-announce] Neutron region migrations, round three

2018-11-05 Thread Andrew Bogott
It's Monday, which means it's time to schedule another round of project migrations. For more info about what this is, consult the link below[1]. Here is the schedule for the next week of moves: Monday, 2018-11-12: Holiday, no activity :) Tuesday, 2018-11-13: commtech, design, discourse, ger

[Cloud] [Cloud-announce] Neutron region migrations, round four

2018-11-15 Thread Andrew Bogott
Next week is a short week in the US, so no project moves will happen.  Here is the schedule for project moves in the following week: Monday, 2018-11-26: collection-alt-renderer, dumps, extdist, glampipe, google-api-proxy, hound, lizenzhinweisgenerator Tuesday, 2018-11-27: osmit, pagemigratio

[Cloud] [Cloud-announce] tools-bastion-02 aka tools-dev downtime on Tuesday

2018-11-17 Thread Andrew Bogott
Hello! I need to shut down the tools-dev host in order to move it to a different server.  The downtime will be brief, but in the meantime I recommend people move their work to a different bastion (e.g. tools-login.wmflabs.org) in order to avoid interruption. This will happen on or near 15:00

Re: [Cloud] [Cloud-announce] tools-bastion-02 aka tools-dev downtime on Tuesday

2018-11-20 Thread Andrew Bogott
Reminder:  This is happening in one hour. On 11/17/18 1:06 PM, Andrew Bogott wrote: Hello! I need to shut down the tools-dev host in order to move it to a different server.  The downtime will be brief, but in the meantime I recommend people move their work to a different bastion (e.g. tools

[Cloud] [Cloud-announce] Neutron region migrations, round five

2018-12-04 Thread Andrew Bogott
With any luck we'll have some more hardware installed by next week, so it's time to move more projects!  This is probably the last round of bulk moves; after this it's all special cases for which I'll contact people directly. Tuesday, 2018-12-11: maps, wm-bot Wednesday, 2018-12-12: mwoffline

[Cloud] [Cloud-announce] additional monitoring on cloudvirts -- don't run them empty!

2018-12-06 Thread Andrew Bogott
I recently noticed that some of our standard kvm/nova monitoring never got copied over from the labvirt puppet code to the cloudvirt puppet code.  Tomorrow I will merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478113/ to fix that. Once that patch is merged, icinga will be a bit t

Re: [Cloud] [Cloud-announce] additional monitoring on cloudvirts -- don't run them empty!

2018-12-06 Thread Andrew Bogott
Sorry, all, this was meant for a different list.  Feel free to ignore! -A On 12/6/18 5:16 PM, Andrew Bogott wrote: I recently noticed that some of our standard kvm/nova monitoring never got copied over from the labvirt puppet code to the cloudvirt puppet code.  Tomorrow I will merge https

[Cloud] [Cloud-announce] Grid outage tomorrow, 16:00 UTC

2018-12-20 Thread Andrew Bogott
Tomorrow I'll be moving the grid engine master node to a new virt host.  That will cause a 15-minute outage during which new jobs (crons, or things submitted by hand) will fail. Existing jobs or webservices will be unaffected by the downtime. I'll start the move at 16:00 UTC on Friday, 2018-

Re: [Cloud] [Cloud-announce] Grid outage tomorrow, 16:00 UTC -- starting in 30!

2018-12-21 Thread Andrew Bogott
Reminder: this interruption will start in about 30 minutes. On 12/20/18 2:39 PM, Andrew Bogott wrote: Tomorrow I'll be moving the grid engine master node to a new virt host.  That will cause a 15-minute outage during which new jobs (crons, or things submitted by hand) will fail. Exi

Re: [Cloud] [Cloud-announce] Grid outage tomorrow, 16:00 UTC

2018-12-21 Thread Andrew Bogott
I'm starting this move now. On 12/21/18 9:32 AM, Andrew Bogott wrote: Reminder: this interruption will start in about 30 minutes. On 12/20/18 2:39 PM, Andrew Bogott wrote: Tomorrow I'll be moving the grid engine master node to a new virt host.  That will cause a 15-minute out

Re: [Cloud] [Cloud-announce] Grid outage tomorrow, 16:00 UTC

2018-12-21 Thread Andrew Bogott
This is done.  Please let us know if you encounter new/unexpected after-effects from this move. -Andrew On 12/21/18 10:01 AM, Andrew Bogott wrote: I'm starting this move now. On 12/21/18 9:32 AM, Andrew Bogott wrote: Reminder: this interruption will start in about 30 minutes. On 12/

[Cloud] [Cloud-announce] VPS bastions updated

2019-01-07 Thread Andrew Bogott
I've just moved the VPS bastions to new hosts.  These hosts are in the new network region, and are also running Debian Stretch. This should be an almost fully-transparent change for all users. On your first use of ssh you may see a warning like:    Warning: Permanently added the ECDSA host key

[Cloud] [Cloud-announce] VPS hardware failure -- downtime ongoing

2019-02-13 Thread Andrew Bogott
We're currently experiencing a mysterious hareware failure in our datacenter -- three different SSDs failed overnight, two of them in cloudvirt1018 and one of them in cloudvirt1024.  The VMs on 1018 are down entirely.  We may move those on 1024 to another host shortly in order to guard against

Re: [Cloud] [Cloud-announce] VPS hardware failure -- resolved for now

2019-02-13 Thread Andrew Bogott
this hardware anyway, out of an abundance of caution, but that's unlikely to produce further downtime.  With luck, this is the last you'll hear about this. -Andrew On 2/13/19 7:25 AM, Andrew Bogott wrote: We're currently experiencing a mysterious hareware failure in our datace

Re: [Cloud] [Cloud-announce] VPS hardware failure -- not yet resolved after all!

2019-02-13 Thread Andrew Bogott
I spoke too soon -- we're still working on this.  Most of these VMs will remain down in the meantime. Sorry for the outage! On 2/13/19 8:21 AM, Andrew Bogott wrote: We don't fully understand what happened, but after Giovanni performed a classic "turning it off and on again&

Re: [Cloud] [Cloud-announce] VPS hardware failure -- not yet resolved after all!

2019-02-13 Thread Andrew Bogott
access it then you're in luck!  If not, stay tuned. -Andrew On 2/13/19 9:15 AM, Andrew Bogott wrote: I spoke too soon -- we're still working on this.  Most of these VMs will remain down in the meantime. Sorry for the outage! On 2/13/19 8:21 AM, Andrew Bogott wrote: We don't

Re: [Cloud] [Cloud-announce] VPS hardware failure -- things are even worse!

2019-02-13 Thread Andrew Bogott
-85b7-37643f03bfea | wikidata-misc | wikidata-dev On 2/13/19 11:23 AM, Andrew Bogott wrote: Here's the latest: cloudvirt1018 is up and running, and many of its VMs are fine. Many other VMs are corrupted and won't start up.  Some of those VMs will probably be los

Re: [Cloud] [Cloud-announce] VPS hardware failure -- aftermath

2019-02-13 Thread Andrew Bogott
Arturo will appear there in a few hours. -Andrew On 2/13/19 1:50 PM, Andrew Bogott wrote: Now cloudvirt1024 is dying in earnest, so VMs hosted there will be down for a while as well.  This is, as far as anyone can tell, just a stupid coincidence. So far it appears that we are going to be ab

[Cloud] [Cloud-announce] Poor toolforge and PAWS performance for (at least) the next 24 hours

2019-02-14 Thread Andrew Bogott
Because bad things come in threes (I'm hoping it's threes and not sevens) the server that hosts toolsdb is now also misbehaving. Brooke just now disabled a troubled drive which may have resolved things, but if the last few hours are any indication then the vast majority of connection or query a

[Cloud] [Cloud-announce] Disabling creation of new Jessie images

2019-03-18 Thread Andrew Bogott
tl;dr: We're about to disable self-service creation of Debian Jessie VMs.  To request an exception, open a Phabricator ticket specifying your need and reasons. -- We're close to polishing off the last few Ubuntu Trusty VMs in the cloud, which means it's time to start thinking about the upcomi

[Cloud] [Cloud-announce] Horizon logins currently broken

2019-03-19 Thread Andrew Bogott
Good morning! As a side-effect of our response to the current gerrit vandalism epidemic, the 2fa integration between Horizon and Wikitech has been disabled.  That means that existing Horizon sessions are still valid but fresh logins will fail. This problem is being actively worked on.  In th

Re: [Cloud] [Cloud-announce] Horizon logins currently broken -- FIXED

2019-03-22 Thread Andrew Bogott
This issue is resolved now, and Horizon should work as usual.  Sorry for the interruption! On 3/19/19 9:54 AM, Andrew Bogott wrote: Good morning! As a side-effect of our response to the current gerrit vandalism epidemic, the 2fa integration between Horizon and Wikitech has been disabled

[Cloud] [Cloud-announce] Partial toolforge and PAWS outages Monday

2019-04-11 Thread Andrew Bogott
Tuesday starting at around 17:00 UTC I'm going to relocate the paws and kubernetes masters to the new network region.  While the VMs are copying, launches of new kubernetes jobs and creation of new PAWS notebooks will fail. The outage should last about an hour -- less if everything goes well,

Re: [Cloud] [Cloud-announce] Partial toolforge and PAWS outages TUESDAY

2019-04-11 Thread Andrew Bogott
My apologies, the earlier version of this email had an incorrect subject line.  This outage will be happening on Tuesday, not Monday. -Andrew On 4/11/19 8:30 PM, Andrew Bogott wrote: Tuesday starting at around 17:00 UTC I'm going to relocate the paws and kubernetes masters to the new ne

Re: [Cloud] [Cloud-announce] Partial toolforge and PAWS outages TUESDAY

2019-04-16 Thread Andrew Bogott
Reminder:  this is happening today, in about three hours. -Andrew On 4/11/19 8:30 PM, Andrew Bogott wrote: Tuesday starting at around 17:00 UTC I'm going to relocate the paws and kubernetes masters to the new network region.  While the VMs are copying, launches of new kubernetes job

Re: [Cloud] [Cloud-announce] [Wikitech-l] Change to Wikitech logins: Username now case-sensitive

2019-04-16 Thread Andrew Bogott
On 4/16/19 7:59 AM, Andrew Otto wrote: Great! Is this just for Wikitech itself or all ldap/wikitech authentication? This notice is related to a change in mediawiki code, so concerns direct logins to wikitech itself.  That said, the 2fa key used by Horizon is stored in a the wikitech database

Re: [Cloud] [Cloud-announce] Partial toolforge and PAWS outages TUESDAY

2019-04-16 Thread Andrew Bogott
This work is still underway.  There are some unforeseen issues but we should be back to normal shortly. On 4/16/19 9:04 AM, Andrew Bogott wrote: Reminder: this is happening today, in about three hours. -Andrew On 4/11/19 8:30 PM, Andrew Bogott wrote: Tuesday starting at around 17:00 UTC

Re: [Cloud] [Cloud-announce] Partial toolforge and PAWS outages TUESDAY

2019-04-16 Thread Andrew Bogott
This is done now.  Paws broke in a thousand ways after the move so it lagged well behind the expected timeline, but normal function of the toolforge k8s grid and Paws should be restored. Let us know if you run into unexpected issues. -Andrew On 4/16/19 1:03 PM, Andrew Bogott wrote: This

[Cloud] [Cloud-announce] Debian Buster image now available in cloud-vps

2019-07-09 Thread Andrew Bogott
    The latest Debian version, 10.0 "buster", was officially released a few days ago[0].  Today, I've built a new Debian buster base image and made it available in all projects.     The Stretch base image will remain available for some time to permit compatibility with existing setups, but any

[Cloud] [Cloud-announce] Instance downtime on August 22nd and 23rd

2019-07-10 Thread Andrew Bogott
    As part of routine networking and OS upgrades, I'll be emptying two hypervisors (cloudvirt1016 and cloudvirt1017) on Monday and Tuesday, the 22nd and 23rd.  This will result in downtime for many VMs as they are copied and restarted.  A complete list of affected instances follows.     I'll

Re: [Cloud] [Cloud-announce] Instance downtime on JULY 22nd and 23rd

2019-07-10 Thread Andrew Bogott
Apologies! This is happening in July rather than August -- about 12 days from now. On 7/10/19 2:24 PM, Andrew Bogott wrote: As part of routine networking and OS upgrades, I'll be emptying two hypervisors (cloudvirt1016 and cloudvirt1017) on Monday and Tuesday, the 22nd and 23rd.  This

[Cloud] [Cloud-announce] Brief toolforge cron outage on Friday, 2019-07-19 at 14:00 UTC

2019-07-17 Thread Andrew Bogott
On Friday I'll be moving the toolforge cron server to new hardware. During the move, any uses of the 'crontab' command will fail gracelessly.  Any cron jobs scheduled to launch during the downtime will be skipped. The move should take 5-10 minutes but may take as long as 30 if there are compl

Re: [Cloud] [Cloud-announce] Brief toolforge cron outage on Friday, 2019-07-19 at 14:00 UTC -- DONE

2019-07-19 Thread Andrew Bogott
This is done. On 7/17/19 3:25 PM, Andrew Bogott wrote: On Friday I'll be moving the toolforge cron server to new hardware. During the move, any uses of the 'crontab' command will fail gracelessly.  Any cron jobs scheduled to launch during the downtime will be skipped. The mov

Re: [Cloud] [Cloud-announce] Instance downtime on JULY 22nd and 23rd

2019-07-22 Thread Andrew Bogott
Reminder:  This is happening today, starting right now. -Andrew On 7/10/19 2:24 PM, Andrew Bogott wrote: As part of routine networking and OS upgrades, I'll be emptying two hypervisors (cloudvirt1016 and cloudvirt1017) on Monday and Tuesday, the 22nd and 23rd.  This will result in dow

Re: [Cloud] [Cloud-announce] Instance downtime on JULY 22nd and 23rd DONE

2019-07-22 Thread Andrew Bogott
Thanks to 10GbE, this went quite a bit faster than I expected and is now done.  I've confirmed that all affected VMs are up and reachable, but please let me know if you encounter any unexpected problems from the move. -Andrew On 7/22/19 8:30 AM, Andrew Bogott wrote: Reminder: Th

[Cloud] [Cloud-announce] Instance downtime on August 5th and 6th

2019-07-31 Thread Andrew Bogott
    As part of routine networking and OS upgrades, I'll be emptying two more hypervisors (cloudvirt1021 and cloudvirt1022) on Monday and Tuesday next week, the 5th and 6th.  This will result in downtime for many VMs as they are copied and restarted.  A complete list of affected instances follow

Re: [Cloud] [Cloud-announce] Instance downtime on August 5th and 6th

2019-07-31 Thread Andrew Bogott
55 AM, Andrew Bogott wrote: As part of routine networking and OS upgrades, I'll be emptying two more hypervisors (cloudvirt1021 and cloudvirt1022) on Monday and Tuesday next week, the 5th and 6th.  This will result in downtime for many VMs as they are copied and restarted.  A comple

Re: [Cloud] [Cloud-announce] Instance downtime on August 5th and 6th

2019-07-31 Thread Andrew Bogott
a pre-arranged window, though, if there's sometime that's better for you.  Over the weekend is possible, even. -A Cyberpower678 English Wikipedia Account Creation Team English Wikipedia Administrator Global User Renamer On Jul 31, 2019, at 10:26, Andrew Bogott <mailto:abog...@wik

Re: [Cloud] [Cloud-announce] Instance downtime on August 5th and 6th

2019-07-31 Thread Andrew Bogott
count Creation Team English Wikipedia Administrator Global User Renamer On Jul 31, 2019, at 10:56, Andrew Bogott <mailto:abog...@wikimedia.org>> wrote: On 7/31/19 9:46 AM, Maximilian Doerr wrote: Oh please no.  cyberbot-db-01 needs to remain up in the next two weeks at least. Postponing isn&#

[Cloud] [Cloud-announce] Brief toolforge grid interruption on Monday, August 5 at 14:00UTC

2019-07-31 Thread Andrew Bogott
I will be moving the toolforge grid master on Monday.  That will mean that for a few minutes it will be impossible to submit new grid jobs.  Jobs that are already running will be unaffected. I'll make the move at 14:00UTC, which is about 7AM Pacific time. -Andrew ___

Re: [Cloud] [Cloud-announce] Instance downtime on August 5th and 6th

2019-08-05 Thread Andrew Bogott
Reminder:  This is happening today, starting in a few minutes. -Andrew On 7/31/19 9:26 AM, Andrew Bogott wrote: In the interest of finishing up this stage of upgrades, I'm going to try to also drain cloudvirt1023 during the same window.  That includes the following additional VMs: ser

Re: [Cloud] [Cloud-announce] Instance downtime on August 5th and 6th -- DONE

2019-08-05 Thread Andrew Bogott
These VMs have all finished copying. Please let me know if you see any ongoing problems that result from the move. -Andrew On 8/5/19 7:40 AM, Andrew Bogott wrote: Reminder: This is happening today, starting in a few minutes. -Andrew On 7/31/19 9:26 AM, Andrew Bogott wrote: In the interest

[Cloud] [Cloud-announce] Puppetmaster changes today

2019-09-09 Thread Andrew Bogott
Later today (starting in a few hours around 18:00 UTC) we'll be rearranging the puppetmaster setup for most cloud VMs[0].  No tools or services (other than puppet) should be affected, but some of you might get grumpy emails about broken puppet runs during the transition, which I encourage you t

Re: [Cloud] [Cloud-announce] Puppetmaster changes today

2019-09-09 Thread Andrew Bogott
The user-facing parts of this are all done now.  New VM creation was broken for a few hours but should be working properly now. -Andrew On 9/9/19 10:09 AM, Andrew Bogott wrote: Later today (starting in a few hours around 18:00 UTC) we'll be rearranging the puppetmaster setup for most

  1   2   3   >